# My Data Science Blogs

## January 20, 2017

### Book Memo: “Python Data Science Handbook”

 Essential Tools for Working with Data For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you’ll learn how to use: • IPython and Jupyter: provide computational environments for data scientists using Python • NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python • Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python • Matplotlib: includes capabilities for a flexible range of data visualizations in Python • Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms

### Machine Learning Madden NFL: The best player position switches for Madden 17

A couple weeks ago, I wrote about my initial efforts toward using machine learning to model the “master equations” that govern the Madden NFL player ratings system. This week, I’d like to put those models to use to compute the

### Magister Dixit

“A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use.” William S. Cleveland ( {2000-2014} )

### Because it's Friday: Infrastructure Collapses

On November 7 1940, the Tacoma Narrows Bridge, opened just four months prior, suffered catastrophic collapse in a windstorrm. (The music in the video is annoying, so here's a version with an alternate soundtrack.)

The story behind the collapse is interesting: while it looks like a build-up of resonance is the culprit, it was actually the fluttering of the bridge desk that brought it down.

That's all from us for this week — we'll be back on Monday.

### We’re hiring!

I’m proud and humbled to announce that my company, Heraflux Technologies, is hiring for a Data Platform Solutions Architect!

We are looking for a highly qualified technologist who is comfortable with database technologies, such as Microsoft SQL Server and other DBMS platforms, and infrastructure technologies, such as virtualization, converged platforms, public cloud, and system administration.

The Solutions Architect is accountable for working with a number of organizations in a variety of industries to improve the availability, performance, and efficiency of the infrastructure stack underneath the application.

This position is more of a SQL Server focused role, and only senior-level SQL Server administrators should apply. More details are available here, and we look forward to hearing from you!

### Free guide to text mining with R

Jilia Silge and David Robinson are both dab hands at using R to analyze text, from tracking the happiness (or otherwise) of Jane Austen characters, to identifying whether Trump's tweets came from him or a staffer. If you too would like to be able to make statistical sense of masses of (possibly messy) text data, check out their book Tidy Tidy Text Mining with R, available free online and soon to be published by O'Reilly.

The book builds on the tidytext package (to which Gabriela De Queiroz also contributed) and describes how to handle and analyze text data. The "tidy text" of the title refers to a standardized way of handling text data, as a simple table with one term per row (where a "term" may be a word, collection of words, or sentence, depending on the application).  Julia gave several examples of tidy text in her recent talk at the RStudio conference:

Once you have text data in this "tidy" format, you can apply a vast range of statistical tools to it, by assigning data values to the terms. For example, can use sentiment analysis tools to quantify terms by their emotional content, and analyze that. You can compare rates of term usage, such as between chapters or to compare authors, or simply create a word cloud of terms used. You coyld use topic modeling techniques, to classify a collection of documents into like kinds.

There are a wealth of data sources you can use to apply these techniques: documents, emails, text messages ... anything with human-readable text. The book includes examples of analyzing works of literature (check out the janesustenr and guternbergr packages), downloading Tweets and Usenet posts, and even shows how to use metadata (in this case, from NASA) as the subject of a text analysis. But it's just as likely you have data of your own to try tidy text mining with, so check out Tidy Text Mining with R and to get started.

### How quickly can you remove spaces from a string?

Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.

Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).

How would you do it efficiently?

size_t despace(char * bytes, size_t howmany) {
size_t pos = 0;
for(size_t i = 0; i < howmany; i++) {
char c = bytes[i];
if (c == '\r' || c == '\n' || c == ' ') {
continue;
}
bytes[pos++] = c;
}
return pos;
}


This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.

But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?

There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:

(((v)-UINT64_C(0x0101010101010101)) & ~(v)&UINT64_C(0x8080808080808080))


All we need to know is that it works. With this tool, we can write a faster function…

uint64_t mask1 = ~UINT64_C(0) / 255 * (uint64_t)('\r');
uint64_t mask2 = ~UINT64_C(0) / 255 * (uint64_t)('\n');
uint64_t mask3 = ~UINT64_C(0) / 255 * (uint64_t)(' ');

for (; i + 7 < howmany; i += 8) {
memcpy(&word, bytes + i, sizeof(word));
uint64_t xor1 = word ^ mask1;
uint64_t xor2 = word ^ mask2;
uint64_t xor3 = word ^ mask3;

if (haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) {
// check each of the eight bytes by hand?
} else {
memmove(bytes + pos, bytes + i, sizeof(word));
pos += 8;
}
}


It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.

Can we do better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.

Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…

__m128i spaces = _mm_set1_epi8(' ');
__m128i newline = _mm_set1_epi8('\n');
__m128i carriage = _mm_set1_epi8('\r');
for(; i + 15 < howmany; i+=16) {
__m128i x = _mm_loadu_si128((const __m128i *) (bytes + i));
__m128i xspaces = _mm_cmpeq_epi8(x,spaces);
__m128i xnewline = _mm_cmpeq_epi8(x,newline);
__m128i xcarriage = _mm_cmpeq_epi8(x,carriage);
__m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces,xnewline),xcarriage);
int mask16 = _mm_movemask_epi8(anywhite); // contains 16 bits, 1 = is white
if(mask16 == 0) {// no match!
_mm_storeu_si128((__m128i *) (bytes + pos),x); // just recopy
pos += 16;
} else { // we need to permute the bits
_mm_storeu_si128((__m128i *) (bytes + pos),x);
pos += 16 - _mm_popcnt_u32(mask16); // popcount!
}
}


The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. Unrolling the loop is a likely candidate for optimization.

Let us see how fast it runs as is!

I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.

 regular code 5.85 cycles / byte using 64-bit words 2.56 cycles/byte SIMD (128 bits) code 0.80 cycles / byte memcpy 0.08 cycles / byte

So the vectorized code is over seven times faster. That’s pretty good. I am using 128-bit registers, so I load and save blocks of 16 bytes. It would be foolish to expect to go 16 times faster, but I was hoping to be 8 times faster… being 7 times faster is close enough.

Yet pruning a few spaces is 10 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?

One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible I could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make algorithmic design more painful.

My approach using 64-bit words is a bit disappointing, as it is only twice as fast… but it has the benefit of being entirely portable… and I am sure a dedicated programmer could make it even faster.

### How Employers Judge Data Science Projects

How Employers Judge Data Science Projects

One of the more commonly used screening devices for data science is the portfolio project.  Applicants apply with a project that they have showcasing a piece of data science that they’ve accomplished.  At The Data Incubator, we run a free eight week fellowship helping train and transition people with masters and PhD degrees for careers in data science.  One of the key components of the program is completing a capstone data science project to present to our (hundreds of) hiring employers.  In fact, a major part of the fellowship application process is proposing that very capstone project, with many successful candidates having projects that are substantially far along if not nearly completed.  Based on conversations with partners, here’s our sense of priorities for what makes a good project, ranked roughly in order of importance:

1. Completion: While their potential is important, projects are assessed primarily based on the success of analysis performed rather than the promise of future work.  Working in any industry is about getting things done quickly, not perfectly, and projects with many gaps, “I wish I had time for”, or “ future steps” suggests the applicant may not be able to get things done at work.
2. Practicality: High-impact problems of general interest are more interesting than theoretical discussions on academic research problems. If you solve the problem, will anyone care? Identifying interesting problems is half the challenge, especially for candidates leaving academia who must disprove an inherent “academic” bias.
3. Creativity: Employers are looking for creative, original thinkers who can identify either (1) new datasets or (2) find novel questions to ask about a dataset. Employers do not want to see the tenth generic presentation on Citibike (or Chicago Crime, NYC Taxi, BTS Flight Delay, Amazon Review, Zillow home price) data. Similarly, projects that explain a non-obvious thesis supported by concise plots are more compelling than ones that present obvious conclusions (e.g. “more riders use Citibike during the day than at night”).  Remember — even a well-trodden dataset like Citibike can have novel conclusions and even untapped data can have obvious ones.  While your project does not have to be completely original, you should Google around to see if your analysis has been done to death.  Employers are looking for data scientists who can find trends in the data that they do not know.
4. Challenge data: Real world data science is not about running a few machine learning algorithms on pre-cleaned, structured CSV files.  It’s often about munging, joining, and processing dirty, unstructured data.  Projects that use pre-cleaned datasets intended for machine learning (e.g. UCI or Kaggle data sets) are less impressive than projects that require pulling data an API or scraping a webpage.
5. Size: All things being equal, analysis of larger datasets is more impressive than analysis of smaller ones.  Real world problems often involve working on large, multi-gigabyte (or terabyte) datasets, which pose significantly more of an engineering challenge than working with small data.  Employers value people who have demonstrated experience working with large data.
6. Engineering: All things being equal, candidates who can demonstrate the ability to use professional engineering tools like git and Heroku will be viewed more favorably. So much of data science is software engineering and savvy employers are looking for people who have the basic ability to do this. To get started, try following this git tutorial or these Heroku tutorials in your favorite language.  Put up your results on github or turn your presentation into a small Heroku app!

Obviously, no project will be perfect.  It’s hard to fulfill all of these criteria, and individual employers undoubtedly have other criteria that we have not mentioned or have a different prioritization.  But more often than not, the fellows who are hired first have projects that satisfy more of these criteria than not.  And lastly, if you’re looking for a data science job or to kick start your career as a data scientist, consider applying to our free eight-week fellowship.

### Learn how to Develop and Deploy a Gradient Boosting Machine Model

GBM is one the hottest machine learning methods. Learn how to create GBM using SciKit-Learn and Python and understand the steps required to transform features, train, and deploy a GBM.

### Analytics Strategies for the Internet of Things – Getting the most out of IoT Data

IoT data offers answers to a simple question: “Are things changing or staying the same?” There are new data streams generated each day, that make it possible to quantify the formerly unquantifiable. The Internet of Things (IoT) enables us to measure processes and react more quickly to ever-evolving conditions, not

The post Analytics Strategies for the Internet of Things – Getting the most out of IoT Data appeared first on Dataconomy.

### Ten Things to Try in 2017

2016 marked a zenith in the data science renaissance. In the wake of a series of articles and editorials declaiming the shortage of data analysts, the internet responded in force, exploding with blog posts, tutorials, and listicles aimed at launching the beginner into the world of data science. And yet, in spite of all the claims that this language or that library make up the essential know-how of a "real" data scientist, if 2016 has taught us anything it's that the only essential skill is a willingness to keep learning.

U.S. Chief Data Scientist DJ Patil famously referred to data science as “a team sport” — and within an organization, data science does work best when practiced collaboratively. But the emerging field of data science is more organic and mutable than it is systematic and coordinated. Data scientists must continue learning new domains, languages, techniques, and applications so as to move forward as the field continues to evolve. In the age of the data product, a more apt analogy for data science is the amoeba — an organism continually in motion, altering its shape, spreading and changing. For this reason, it's probably more difficult to stay a data scientist than it is to become one.

For those of us who spent 2016 reading all those articles ("50 essential things every data scientist MUST know" and "How to spot a FAKE data scientist"), taking Coursera and Codeacademy courses, following Analytics Vidhya tutorials, competing in Kaggle competitions, and trolling Kirk Borne on Twitter, now is a good time to think about what comes next.

So now you're a data scientist, congrats; where do you go from here?

## Okay, you're a data scientist, now what?

In the spirit of New Years resolutions, here's a list of 10 (technical) things for the intermediate data scientist to try in 2017 — things you do can do to push yourself forward, keep your edge, set yourself apart, and be a better data scientist by 2018.

### 1. Adopt repeatable, systematic processes

Let's assume you're an old hat at data analytics, but how systematic is your process? If the answer is "not very," you might want to consider establishing a more structured path to discovery. Not only can a more systematic approach make you more efficient, but it can also ensure that you consistently consider a broader range of techniques for each problem, including things like graph analytics and time dynamics.

### 2. Explore a new data type

Love CSV? So do we! CSVs are great — simple, compact, distributable, and gotta love those header rows. But getting overly comfortable with wrangling one particular type of data can be limiting. This coming year, why not expand your analytical range by trying out a new serialization format like JSON or XML? Work mainly with categorical data? Try experimenting with a time series analysis. Mostly use relational data? Try your hand at unstructured text or geospatial data like rasters.

### 3. Break out of your machine learning rut

Have you fallen into an algorithmic comfort zone? People like to argue about which machine learning model is the best, and everyone seems to have their favorite! Sure, picking a good model is important, but it's debatable whether a model can actually be 'good' devoid of the context of the domain, the hypothesis, the shape of the data, and the intended application. Fortunately, high-level Python libraries like Scikit-learn (also Tensorflow, Theano, NLTK, Gensim, and Spacy) provide APIs that make it easy to test and compare a host of models without additional data wrangling. In 2017, build breadth by exploring new models — honestly, it has become lazy not to!

### 4. Learn how your favorite models actually work

We're living in an age where any data scientist with a little Python know-how can use a library like Scikit-Learn to predict the future, but few can describe what's actually happening under the hood. Guess what? Clients and customers are becoming more discerning and demanding more interpretability. For the many self-taught machine learning practitioners out there, now's the time to learn how that algorithm you love so much actually works. For Scikit-Learn-users, check out the documentation to find a link to the paper used in the implementation for each algorithm. You can also check out some of our previous posts to learn how things like PCA, distributed representations, skipgram, and parameter tuning work in theory as well as practice.

### 5. Start using pipelines

The machine learning process often combines a series of transformers on raw data, transforming the data set each step of the way until it is passed to the fit method of a final estimator. A pipeline is a mechanism for sanely combining these steps — a step-by-step set of transformers that takes input data and transforms it, until finally passing it to an estimator at the end. Pipelines can be constructed using a named declarative syntax so that they're easy to modify and develop. If you're just getting started with pipelines, check out Zack Stewart's excellent post on the topic.

### 6. Build up your software engineering chops

Data scientists tend to leave a massive amount of technical debt in their wake. But what do you think will happen to those data scientists when all the good software engineers figure out how to do logistic regressions?
In 2017, boost your software engineering skills by pushing yourself to develop higher quality code, to build object-oriented, reusable methods, and to practice good habits like writing documentation and using exception handling to facilitate better communication with the team (and with future you!).

### 7. Learn a new programming language

Move over imposter syndrome and meet your know-it-all sibling, contempt culture! Now that you're a data scientist and have accumulated enough confidence to override the natural impulse toward self-doubt, there's a tendency to get a bit cocky. Don't! Stay humble by pushing yourself to learn a new programming language. Know Python? Try teaching yourself Javascript or CSS. Know R? Branch out to learn Julia or master SQL.

### 8. Consider data security

What's the biggest security risk for a modern business? Hiring a data scientist! Why? It's because data scientists often unknowingly expose their companies to massive security vulnerabilities. Attackers are interested in all kinds of data, and as Will Voorhees says in Eat Your Vegetables, as data scientists we often mistakenly think we can rely on the magic information security elves to protect our precious data. In 2017, make an effort to learn about encryption, account separation, and temporary credentials — and for the love of Hilary Mason, stop committing your access tokens to Github.

### 9. Make your code go faster

Still have scripts running all night in the hopes you'll wake up to results to enjoy with your morning coffee? You're out of excuses; it's time to get on the MapReduce bandwagon and teach yourself Hadoop and Spark. Know what else will speed things up? More efficient code! One low-hanging fruit is mutable data structures. Sure, Pandas data frames are great, but did you ever wonder what makes those lookups, joins, and aggregations so easy? It's holding a bunch of data in memory all at the same time. In 2017, try switching to NumPy arrays and see what you think.

### 10. Contribute to an open source project

You may not realize it, but as a data scientist, you are already significantly involved in the open source community. Nearly every single one of the tools we use — Linux, Git, Python, R, Julia, Java, D3, Hadoop, Spark, PostgreSQL, MongoDB — is open source. We look to StackOverflow and StackExchange to find answers to our programming questions, grab code from blog posts, and pip install like there's no tomorrow. In 2017, consider giving back by making your own contribution to an open source project — it's not just for data science karma, it's also a way to build up your GitHub cred! A lot of the senior data scientists I know don't even look at candidates' resumes anymore; someone's Github portfolio and commit history often tell volumes more.

## Conclusion

In 2016, the world of analytics, machine learning, multiprocessing, and programming got a lot bigger. The result of the data science eruption has been a broader and more diverse community of colleagues, people who will meaningfully augment not only the quantity but the quality of the next generation of data products.

And yet, this expansion has also meant that the field of data science began to lose some of its mysticism and cache. As new practitioners flood the market, data scientist salaries have started to drop off, from highs in the $200K-range to ones topping out closer to$150K; as Barb Darrow signaled in her 2015 Fortune article, "Supply, meet demand. And bye-bye perks."

So how can you distinguish yourself in a landscape which may once have felt impenetrable, but has now started to feel routine? Whether you use ours or set your own, pick ten things you can do over the next year to keep your mind sharp and your skills current, and remember — when it comes to data science, nothing endures but change!

### Trump inauguration bingo

I reached a milestone of 100,000 LinkedIn followers this week. Here are some of my most popular recent posts.

### Eat Melon: A Deep Q Reinforcement Learning Demo in your browser

Check "Eat Melon demo", a fun way to gain familiarity with the Deep Q Learning algorithm, which you can do in your browser.

### The Data Science Puzzle, Revisited

The data science puzzle is re-examined through the relationship between several key concepts in the realm, and incorporates important updates and observations from the past year. The result is a modified explanatory graphic and rationale.

### Alternatives to jail for scientific fraud

Mark Tuttle pointed me to this article by Amy Ellis Nutt, who writes:

Since 2000, the number of U.S. academic fraud cases in science has risen dramatically. Five years ago, the journal Nature tallied the number of retractions in the previous decade and revealed they had shot up 10-fold. About half of the retractions were based on researcher misconduct, not just errors, it noted.

The U.S. Office of Research Integrity, which investigates alleged misconduct involving National Institutes of Health funding, has been far busier of late. Between 2009 and 2011, the office identified three three cases with cause for action. Between 2012 and 2015, that number jumped to 36.

While criminal cases against scientists are rare, they are increasing. Jail time is even rarer, but not unheard of. Last July, Dong-Pyou Han, a former biomedical scientist at Iowa State University, pleaded guilty to two felony charges of making false statements to obtain NIH research grants and was sentenced to more than four years in prison.

Han admitted to falsifying the results of several vaccine experiments, in some cases spiking blood samples from rabbits with human HIV antibodies so that the animals appeared to develop an immunity to the virus.

“The court cannot get beyond the breach of the sacred trust in this kind of research,” District Judge James Gritzner said at the trial’s conclusion. “The seriousness of this offense is just stunning.”

In 2014, the Office of Research Integrity had imposed its own punishment. Although it could have issued a lifetime funding ban, it only barred Han from receiving federal dollars for three years.

Sen. Charles Grassley (R-Iowa) was outraged. “This seems like a very light penalty for a doctor who purposely tampered with a research trial and directly caused millions of taxpayer dollars to be wasted on fraudulent studies,” he wrote the agency. The result was a federal probe and Han’s eventual sentence.

I responded that I think community service would make more sense. Flogging seems like a possibility too. Jail seems so destructive.

I do agree with Sen. Grassley that a 3-year ban on federal dollars is not enough of a sanction in that case. Spiking blood samples is pretty much the worst thing you can do, when it comes to interfering with the scientific process. If spiking blood samples only gives you a 3-year ban, what does it take to get a 10-year ban? Do you have to be caught actually torturing the poor bunnies? And what would it take to get a lifetime ban? Spiking blood samples plus torture plus intentionally miscalculating p-values?

The point is, there should be some punishments more severe than the 3-year ban but more appropriate than prison, involving some sort of restitution. Maybe if you’re caught spiking blood samples you’d have to clean pipettes at the lab every Saturday and Sunday for the next 10 years? Or you’d have to check the p-value computations in every paper published in Psychological Science between the years of 2010 and 2015?

The post Alternatives to jail for scientific fraud appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Learning the structure of learning

If anything, there has been a flurry of effort in learning the structure of new learning architectures. Here is an ICLR2017 paper on the subject of meta learning and posters of the recent NIPS symposium on the topic.

Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc Le (Open Review is here)

Abstract: Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.

• Jürgen Schmidhuber, Introduction to Recurrent Neural Networks and Other Machines that Learn Algorithms
• Paul Werbos, Deep Learning in Recurrent Networks: From Basics To New Data on the Brain
• Li Deng, Three Cool Topics on RNN
• Risto Miikkulainen, Scaling Up Deep Learning through Neuroevolution
• Jason Weston, New Tasks and Architectures for Language Understanding and Dialogue with Memory
• Oriol Vinyals, Recurrent Nets Frontiers
• Mike Mozer, Neural Hawkes Process Memories
• Ilya Sutskever, Using a slow RL algorithm to learn a fast RL algorithm using recurrent neural networks (Arxiv)
• Marcus Hutter, Asymptotically fastest solver of all well-defined problems
• Nando de Freitas , Learning to Learn, to Program, to Explore and to Seek Knowledge (Video)
• Alex Graves, Differentiable Neural Computer
• Nal Kalchbrenner, Generative Modeling as Sequence Learning
• Panel Discussion Topic: The future of machines that learn algorithms, Panelists: Ilya Sutskever, Jürgen Schmidhuber, Li Deng, Paul Werbos, Risto Miikkulainen, Sepp Hochreiter, Moderator: Alex Graves

Posters of the recent NIPS2016 workshop

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### The inauguration of Donald Trump

WELCOMED by some, dreaded by others, the day of Donald Trump’s inauguration as president of the United States has arrived. This is the 58th official inauguration, one of the world’s oldest ceremonies for passing the baton to a new head of state, and one with a rich tradition.

### Four short links: 20 January 2017

HPC Training, Hard Tech, Analytics Platform, and Government-Vended Surveillance Data

1. Extreme Computing Lectures -- from Argonne National Labs' Extreme Computing Training course. Scary-hard HPC for Dummies.
2. How To Build a Hard Tech Startup -- rings true to me. The forces of physics, biology, and Moore’s Law can bedevil you in unexpected ways. Building a winning product is already challenging enough; with hard tech startups, the obstacles are even greater.
3. Myria -- open source from the University of Washington. A scalable analytics-as-a-service platform based on relational algebra.
4. Cashing In On Dystopia -- the sale of personal data through an information black market that appears to be plugged into national police and government databases already. (via Glyn Moody)

### Catalog of visualization tools

There are a lot of visualization-related tools out there. Here’s a simple categorized collection of what’s available, with a focus on the free and open source stuff.

This site features a curated selection of data visualization tools meant to bridge the gap between programmers/statisticians and the general public by only highlighting free/freemium, responsive and relatively simple-to-learn technologies for displaying both basic and complex, multivariate datasets. It leans heavily toward open-source software and plugins, rather than enterprise, expensive B.I. solutions.

I found some broken links, and the descriptions need a little editing, but it’s a good place to start.

Also, if you’re just starting out with visualization, you might find all the resources a bit overwhelming. If that’s the case, don’t fret. You don’t have to learn how to use all of them. Let your desired outcomes guide you. Here’s what I use.

Tags:

### Facebook researchers at SPSP 2017

This week I will be speaking at SPSP 2017, the Society for Personality and Social Psychology’s annual convention, about behavioral science in industry.  As I reflected on how I spend my time as an industry practitioner – traveling the world to understand the users or potential users of the products and services Facebook delivers to people around the globe – I realized there is one particular area where closer partnerships from industry and academia may be able to shine new light in social and personality psychology: that of cross-cultural psychology.

I will illustrate my perspective through a story about how social and personality psychology currently treats cultural differences in behavior as well as some thoughts on how industry and academia may partner in the future to change some underlying dynamics and gain a true global understanding of the world.

Let’s start with food; many foods have generally agreed-upon, established associations based on where they originated (or are perceived to have originated) or ethnic association.

If you think about fried chicken, many people in the US would automatically associate it as being from the South, just as they would culturally associate spaghetti as being from Italy.

What would you think if I told you about a meal where the spaghetti was in a sweet red sauce and was served with fried chicken? Would you believe me if I told you it was a meal typical of an Asian country?  Does that meal fit with your concept of “Asian food”?

I asked a sample of 233 participants from an online panel, representative of people in the USA, which of the following meals they thought was “Asian food”:

• Steak and eggs
• Shrimp and grits
• Fish and chips
• Fried chicken and spaghetti
• None of the above.

As you can see (Figure 1), the respondents overwhelmingly said that none of the meals were “Asian food” – but fried chicken and spaghetti is. This is not a trick question. This meal is squarely from the Philippines, an Asian country. The meal may not be immediately associated as Asian amongst the US population due to Western culinary stereotypes that tend to lump together countries we are more familiar with, such as China and Japan, while largely ignoring the rest of the varied and vast Asian continent. We researchers also fall into the trap of basing our conclusions about Asia on data from a sample of Asian countries that cannot reasonably generalize to the entire region.

If fried chicken and spaghetti is an Asian meal and none of us knew that, what else are we missing?

While Asia is a vast, diverse area of the globe, I could not remember ever seeing a published paper from social psychology that included research from the Philippines. So we checked four top journals: Journal of Personality and Social Psychology, Psychological Science, Journal of Experimental Social Psychology, and Journal of Cross Cultural Psychology, and found that in the past decade, there has only been approximately one paper published that included data collected in the Philippines. Why is that? The Philippines is an archipelago nation of about 100 million people in Southeast Asia. A high percentage of its population not only speaks English but also has regular personal interactions with Westerners due to jobs in tourism or business process outsourcing (i.e., call centers). Yet over the past ten years, in 4 top journals, the vast majority of the data collected in the East – the data for nearly 80% of the papers we examined – was only collected in China or Japan. One reason that data from China or Japan probably cannot speak to psyche and behavior in the East or Asia as a whole is that Asia is replete with countries whose cultures have been influenced by Western colonization, yet China and Japan were never colonized by the West.

Psychology’s undersampling is not unique to the Philippines; the same goes for Indonesia, another Southeast Asian archipelago nation. Over the past ten years, approximately one published study included data collected from Indonesia. I find that surprising given that Indonesia is the most populous Muslim country in the world, with over a quarter of a billion citizens. Perhaps the fact that it is a predominantly Muslim country is irrelevant. However, I believe Indonesia offers the potential to discover interesting social phenomena built upon Muslim values (similar to how US-focused research often finds roots for psychological phenomena in Judeo-Christian values). Indonesia is certainly very different culturally than China and Japan, so why aren’t we, as a field, doing research there?

The issue with cross-cultural psychology is not just that research is primarily conducted in two Asian countries. The concern is how we, as a field, generalize from the data we collect. In these top four journals over the past ten years, the overwhelming majority of papers included data collected from only one Asian country, yet about half of them made some sort of holistic East vs West generalization in their conclusions.

So where do we go from here? From here, we need to go everywhere. We can’t keep collecting data from two Asian countries and assume that we are bringing a global perspective to the field. Here are two thoughts:

1. Social and personality psychology as a field has an opportunity to develop a new paradigm: extremely brief (think 5 minutes or less) academic studies that do not rely on random or representative samples. Industry researchers often accept requests from each other to add very brief “tack on” studies to their longer research sessions; coopting this practice to propose brief academic studies that could be tacked onto existing industry research plans across the globe might create low-cost collaboration that could yield tremendous amounts of data for academics from countries that they do not typically have access to for data collection. The countries I have travelled to over the past 5 years are shown in Figure 2; multiply that across the number of practitioners who travel abroad for work and imagine how much of the world we could cover if we tacked on a little academic study to every business trip we took.
2. Partner with local researchers practicing outside academia. Whenever I conduct research outside the US, I partner with skilled local researchers. They often do not have the same backgrounds as academic psychologists, but that doesn’t detract from the valuable insights they have about the local population or their abilities to be valuable research collaborators.

SPSP 2017 brings together the top social and personality psychologists in the world, giving us the opportunity to present and discuss research, network and collaborate on projects, while advancing science and pedagogy in the field. I’m looking forward to working with my fellow psychologists in industry and academia to discuss together how we might all bring a truly global perspective to our work.

### Other Facebook Research being presented at SPSP:

Posters:
Does the Course of Premarital Courtship Predict Newlywed’s Subsequent Marital Outcomes?
By Grace Jackson, UCLA (now at Facebook), Thomas Bradbury, UCLA and Benjamin Karney, UCLA.

Income Inequality and Infrahumanizing Attitudes toward Political Outgroups, by Emily Becklund, UC Berkeley (now at Facebook) and Serena Chen, UC Berkley.

Workshop:
Preparing Your Students for a Career in the Private Sector, by Liz Keneski, Erin Baker, Tim Loving, and Kelley Robinson all of Facebook.

### Forward Propagation: Building a Skip-Gram Net From the Ground Up

Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab. Make sure to check out the other posts in the series so far:

Let's continue our treatment of the Skip-gram model by traversing forward through an single example of feeding forward through a Skip-gram neural network; from an input target word, through a projection layer, to an output context vector representing the target word's nearest neighbors. Before we get into our example, though, let's revisit some fundamentals on neural networks.

## Neural Network Fundamentals

Neural networks originally got their name from borrowing concepts observed in the functioning of the biological neural pathways in the brain. At a very basic level, there is a valid analogy between a node in a neural network and the neurons in a biological brain worth using to explain the fundamental concepts.

The biological model of a neural pathway consists of specialized cells called dendrites that receive chemical signals in one end that build up an electric potential until a threshold is reached (called activation), resulting in more or different chemical signals to be released on the other end. These output chemicals can then act as input to other neurons, causing them to activate as well. In the biological model, a combination of activated neurons can be interpreted in such a way to cause an action on behalf of the organism.

The computational model of a neural network represents this process mathematically by propagating input data in a particular way through a graph structure containing nodes inside an input layer, hidden layer, and output layer. The input layer represents the input data, analogous to the incoming chemical signals of a neuron. The hidden layer represents the neurons receiving the input signal. The output layer is a simplification of the decision interpretation of the biological model; it represents the decision or action a given activated pathway indicates. In other words, the output layer is a classifier.

The input data, which in the biological system are chemical signals but in the computational model are numerical quantities, are made available to all potential hidden layer nodes through a linear transformation of a weight layer. This provides a way for the computational model to represent that not all nodes receive the input data evenly or with the same affinity.

The hidden layer, representing our neurons, receives this input and then applies a non-linear activation function to simulate neurons' activation. The simplest activation function is the Heaviside step function which simply turns a hidden layer node to on (1) or off (0) (and a neural network that uses this type of activation function is nicknamed a perceptron). More common activation functions in language models are the logit function, the hyperbolic tangent, and their variants.

The output layer is produced by a linear transformation from the hidden layer through another weight matrix. This provides a way for the computational model to represent that activated nodes have varying effects on the final decision interpretation.

Below is a figure from the Appendix of the paper "word2vec Parameter Learning Explained" by Rong et al that represents this process. The input layer {x(k)} = {x(1) ... x(K)} is linearly transformed through a weight matrix {w(ki)}. That input is activated with an activation function to produce the hidden layer {h(i)} = {h(1) .... h(N)}. The resulting hidden layer {h(i)} is linearly transformed through another weight matrix {w'(ij)} to finally result in the output layer {y(j)} = {y(1) .... y(M)}.

The learning part of both a biological brain and a computational neural network has to do with strengthening or weakening certain paths given certain input data. The biological system to achieve this is well beyond my ability to explain, but on the computational side we simulate it by modifying our weight layers in reaction to some objective function (a term often conflated with a related term cost function and which for our purposes are synonymous). Knowing what our input layer looks like and what our output layer should represent, we pick some objective function that allows us to compare our output layer classifier with the expected result for a given input. In other words, we pick something the output layer should represent (like a probability) that we can compare to the subset of reality represented by our training data (like an observed frequency).

By iteratively training examples and penalizing the weight layers for the error observed based on our interpretation of the output layer, the net eventually learns how to output the values that mean what we want. We use a mathematical technique called the chain rule to estimate the effect of different parts of the network (e.g. only the second weight layer, or only the first weight layer) on the total error. We then modify each piece accordingly using an algorithmic technique called gradient descent (or ascent, depending on whether you are maximizing or minimizing your objective function). In practice, we use a less computationally expensive version of gradient descent called stochastic gradient descent that randomly applies the chain rule as opposed to after every single training example.

Once you have reached a predefined threshold for your objective function (or got tired of running your network for so long), your model is complete. You would now be able to submit new input data and receive a classification result based on all the examples you trained with in the first place.

## Skip-gram Model Architecture

Now that we've reestablished some fundamentals, let's set up the specific architecture of Skip-gram. Recall that the goal of Skip-gram is to learn the probability distribution of words in our vocabulary being within a given distance (context) of an input word. We are effectively choosing our output layer to represent that probability distribution. Once we've trained the network with this goal, instead of using the model to predict the context probability distribution for future words, we'll derive distributed representations of words we saw (word embeddings) from the input weight layer of the final model.

As you may recall from our previous post, our example corpus consisted of the following sentence.

“Duct tape works anywhere. Duct tape is magic and should be worshiped.”

In that post, we preprocessed this sentence by removing punctuation and capitalization, dropping stopwords, and stemming the remaining words. This left us with "duct tape work anywher duct tape magic worship."

Let's start by defining some terminology and the variables you'll see associated with them throughout this post.

A vocabulary (v), is a deduplicated, ordered list of all the distinct words in our corpus. The specific order of the words doesn't matter, as long as it stays the same throughout the entire process.

>>> v = ["duct", "tape", "work", "anywher", "magic", "worship"]
>>> print(len(v))
6
>>> print(v[0])
"duct"


A context (c) is a zone, or window, of words before and after a target word that we want to consider "near" the target word. We can select a context of our choice, which is 2 in the example below.

>>> c = 2


We have another selection to make, this time about the projection layer. We can specify the size of the projection layer in nodes (n). The higher the number of nodes, the higher dimensionality the projection between layers will be and the higher dimensionality your final word embeddings will be. We'll look at this in more detail a little later. For now, let's just remember that n is a tunable parameter, and let's set it to 3.

>>> n = 3


Before we actually get to working through the neural network, let's reconsider how we will provide each input word to it.

## Word Encodings

What do each of these words look like when they go into the neural network? To perform transformations on them through each layer, we have to represent the words numerically. Additionally, each word's representation must be unique. There are many vector-based approaches to represent words to a neural network, as discussed in the first post in this series. In the case of Skip-gram, each input word is represented as a one-hot encoded vector. This means that each word is represented by a vector of length len(v), where the index of the target word in v contains the value 1 and all other indices contain the value 0.

To illustrate, let's recall our vocabulary vector.

>>> print(v)

["duct", "tape", "work", "anywher", "magic", "worship"]


We would expect to represent each word in our vocabulary with the following one-hot encoded vectors:

"duct" [1,0,0,0,0,0]

"tape" [0,1,0,0,0,0]

"work" [0,0,1,0,0,0]

"anywher" [0,0,0,1,0,0]

"magic" [0,0,0,0,1,0]

"worship" [0,0,0,0,0,1]


This selection of one-hot encoded vectors, as the representation of words to the Skip-gram net, is actually going to become quite important later. But let's get to some forward propagation now.

## Birds-Eye View of Skip-Gram as a Neural Net

The Skip-gram neural network is a shallow neural network consisting of an input layer, a single linear projection layer, and an output layer. The input layer is a one-hot encoded vector representing the input word, and the output layer is a probability distribution of what words are likely to be seen in the input word's context. The objective function of the net attempts to maximize the output layer probability distribution against what is known from the source corpora about a given word's context frequency distribution.

That is quite dense, and several of these parts are important, but I want to point out a major divergence from the classic neural network model that we just described. Classic neural nets have hidden layers transformed with a non-linear activation function as described in the neural net refresher above - Heaviside step functions, logit, hyperbolic tangent. One of the major computational tradeoffs Mikolov et al made to make Skip-gram and CBOW feasible whereas earlier neural nets were prohibitively limited by training time was to completely remove the activation step from the hidden layer. From the paper Efficient Estimation of Word Representations in Vector Space, they state:

"The main observation from [prior work] was that most of the complexity is caused by the nonlinear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently."

Therefore, what a classical neural net would term a 'hidden layer' is generally regarded instead as a projection layer for the purposes of Skim-gram and CBOW. In Mikolov et al, they distinguish these two models from classic, non-linearly activated neural nets by calling them log-linear models.

To train the net, we scan through our source corpus, submitting each word we encounter once for each c*2 output context. The input word is projected through a weight layer and then transformed through another weight layer into an output context. Each output node is of size v and contains at each index a score for that index's word, estimating the word's likelihood of appearing in that context location.

Below is a reproduction of the architecture diagram from Google's introductory Skip-gram paper, Efficient Estimation of Word Representations in Vector Space.

Let's break that down again using the terms in the diagram. The Skip-gram neural net iterates through a series of words one at a time, as input w(t). Each input word w(t) is fed forward through the network c*2 times, once for each output context vector. Each time w(t) is fed through the network, it is linearly transformed through two weight matrices to an output layer that contains nodes representing a context location: where c=2, those context locations are one of w(t-2) to w(t+2). The output nodes, each the size of the vocabulary, contain scores at each index estimating the likelihood that a word in the vocabulary would appear in that context position. For example, with the input word "tape" the score for the w(t-2) vector of length v at its second index (w(t-2)[2]) would be a score for the likelihood that the word "work" would appear in context two words behind the input word "tape." Those raw scores are later turned into actual probabilities using the softmax equation.

For each given training instance, the net will calculate its error between the probability generated for each word in each context location and the observed reality of the words in the context of the training instance. For example, the net may calculate that "work" has a 70% chance of showing up two words before the word "tape", but we can determine from the source corpus that the probability is really 0%. Through the process of backpropagation, the net will modify the weight matrices to change how it projects the input layer through to the output layer in order to minimize its error: for example, to minimize the error between the calculated 70% and the observed 0%. Then the next word in the corpus will be sent as an input c*2 times, then the next, and so on.

Once the net is done training, the first weight layer will contain the vector representations of each word in the vocabulary, in order. After we go through all of feed-forward and backprop, you'll see why, but for now while we've got the gist of the architecture, let's try plugging in some numbers.

## A Real Life Net

Time for some code! Let's get a more detailed, color-coded diagram going:

The figure above is an extended depiction of a Skip-gram network for a training example feeding forward to output context vectors representing a context window c=2. The input word is a one-hot encoded vector the size of the vocabulary. This is connected to a projection layer defined by our node size parameter to be of size len(n). The input layer is projected via linear transformation through the input weight matrix p, of v x n dimensions. With a context size of 2, each training example will feed forward 4 times to 4 output vectors: w(t-2), w(t-1), w(t+1) and w(t+2). The projection layer is connected to the output layers by the output weight matrix p', which has n x v dimensions.

From left to right, each subsequent layer is constructed mathematically simply by taking the dot product of a layer and its weight matrix.

The weight matrices p and p' can be initialized in many ways, and will eventually be tuned using backpropagation. For now, let's initialize the net simply with random numbers. Let's expand our example with those randomized matrices and calculate the transformations from the input word to the projection layer and from the projection layer to the output layer. Let's consider a forward pass through the net where our input word is the second word in our vocabulary, "tape."

First let's import our packages and set up our first two pieces, the input array for "tape" (which you'll recall is one-hot encoded), and our randomized first weight layer p:

>>> import numpy as np
>>> np.random.seed(42)

>>> input_array_tape=np.array([0,1,0,0,0,0]) #"tape"

>>> input_weight_matrix = np.random.random_sample((6,3))
>>> print(input_weight_matrix)
[[ 0.37454012  0.95071431  0.73199394]
[ 0.59865848  0.15601864  0.15599452]
[ 0.05808361  0.86617615  0.60111501]
[ 0.70807258  0.02058449  0.96990985]
[ 0.83244264  0.21233911  0.18182497]
[ 0.18340451  0.30424224  0.52475643]]


Now, let's calculate from the input layer to the projection layer. You'll see here that the effect of a linear transformation from the one-hot encoded layer through the weight layer means we are simply projecting a row of the weight matrix matching the index of the one-hot encoded input word through to the next layer. That is why we keep calling this process 'projection' and are shying away from the terms 'hidden' or 'activation' layer (though those terms are still sometimes used to describe this process). You can compare the output vector below to the matrix above and see that clearly - we've simply picked out the vector from the second matrix row of p during this projection process.

>>> projection = np.dot(input_array_tape,input_weight_matrix)
>>> print(projection)
[ 0.59865848  0.15601864  0.15599452]


Next, let's randomize the output weight matrix p'.

>>> output_weight_matrix = np.random.random_sample((3,6))
>>> print(output_weight_matrix)
[[ 0.43194502  0.29122914  0.61185289  0.13949386  0.29214465  0.36636184]
[ 0.45606998  0.78517596  0.19967378  0.51423444  0.59241457  0.04645041]
[ 0.60754485  0.17052412  0.06505159  0.94888554  0.96563203  0.80839735]]


We'll use that to calculate from the projection layer to the first output context vector w(t-2).

>>> output_array_for_input_tape_and_orange_output_context = np.dot(projection, output_weight_matrix)
>>> print(output_array_for_input_tape_and_orange_output_context)
# [ 0.42451664  0.32344971  0.40759145  0.31176029  0.41795589  0.35267831]


At this point, we have the first output vector after we've performed calculations against the randomized weight layers for p and the p'. For the purposes of the diagram I have rounded all the indices in our matrices to their tenths place, but let's take a look to re-establish where we are.

Now that we have this context vector, let's dive in again to what the context vectors really are. Each context vector is the length of the vocabulary, and since the vocabulary is an ordered list, each index in the context vector can be traced back to a certain word in the vocabulary. The meaning of the value at each index in the context vector is an estimation of the likelihood of appearing in that context window for the word in the vocabulary that index traces back to.

Let's unpack that some more with the following code snippets. The variable output_array_for_input_tape_and_orange_output_context is the output vector for that orange colored context w(t-2) we just calculated given the input word "tape". The value at each index of that vector represents each vocabulary word's estimated likelihood to be 2 words behind "tape", per the vocabulary order of our original vocabulary vector v.

>>> print(output_array_for_input_tape_and_orange_output_context)
# [ 0.42451664  0.32344971  0.40759145  0.31176029  0.41795589  0.35267831]


We know the vocabulary is made up of this ordered list.

>>> print(v)

["duct", "tape", "work", "anywher", "magic", "worship"]


If we zip those together, we annotate each index in the context vector that estimates the likelihood of being 2 words behind "tape" with what word in the vocabulary we're estimating for:

>>> print(list(zip(v, output_array_for_input_tape_and_orange_output_context)))
#[('duct', 0.42451663675598933),
#('tape', 0.32344971050993732),
#('work', 0.40759145057525981),
#('anywher', 0.3117602853605092),
#('magic', 0.41795589389125587),
#('worship', 0.35267831257488347)]


Since this whole net at this point was trained with the input word "tape," we can extrapolate from this output vector for w(t-2) that the word "duct" is estimated to be two words behind our input word "tape" with a likelihood score of 0.42451663675598933, that that the word "tape" is estimated to be two words behind our input word "tape" with a likelihood score of 0.32344971050993732, and so on.

Intuitively, we can see that we have very similar scores right now for each word, though each word is not equally likely to appear at position w(t-2), aka two spots before our target word, since we've seen our corpus. We will use backpropagation to calculate our error here against the known likelihood of these words in context and adjust the weight vectors to reduce our error. However, there is one more processing step to these output vectors before we're ready to have the net begin backpropagation: applying the softmax equation. The softmax will normalize the values in each context vector to represent probabilities, so we can directly compare them during backpropogation to the known frequency distributions in the source corpus. Softmax will be the topic of the second half of this feedforward post series, so stay tuned!

### Data Exploration with Python, Part 1

Exploratory data analysis (EDA) is an important pillar of data science, a critical step required to complete every project regardless of the domain or the type of data you are working with. It is exploratory analysis that gives us a sense of what additional work should be performed to quantify and extract insights from our data. It also informs us as to what the end product of our analytical process should be. Yet, in the decade that I've been working in analytics and data science, I've often seen people grasping at straws when it comes to exploring their data and trying to find insights.

Having witnessed the lack of structure in conventional approaches, I decided to document my own process in an attempt to come up with a framework for data exploration. I wanted the resulting framework to provide a more structured path to insight discovery: one that allows us to view insight discovery as a problem, break that problem down into manageable components, and then start working toward a solution. I've been speaking at conferences over the last few months about this framework. It has been very well-received, so I wanted to share it with you in the form of this blog post series.

## Introducing the Framework

The framework I came up with, pictured below, consists of a Prep Phase and an Explore Phase. Each phase has several steps in it that we will walk through together as we progress through this series.

The series will have four parts. The first two posts will cover the Prep Phase, and the second two posts will cover the Explore Phase.

• Part 1 will introduce the framework, the example data set to which we will be applying it, and the first two stages of the Prep Phase (Identify and Review) which serve to prepare you, the human, for exploring your data.
• Part 2 will cover the third stage of the Prep Phase (Create), which will shift the focus from preparing the analyst to preparing the data itself to be explored.
• Part 3 will begin the Explore Phase, where we will demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data.
• Part 4 will continue our exploration of relationships, leveraging graph analysis to examine connections between the entities in our data.

There is an important point about the framework's two phases that I would like to stress here. While you can technically complete the Explore Phase without the Prep Phase, it is the Prep Phase that is going to allow you to explore your data both better and faster. In my experience, I have found that the time it takes to complete the Prep Phase is more than compensated for by the time saved not fumbling around in the Explore Phase. We are professionals, this is part of our craft, and proper preparation is important to doing our craft well.

I hope you enjoy the series and are able to walk away from it with an intuitive, repeatable framework for thinking about, analyzing, visualizing, and discovering insights from your data. Before we jump into looking at our example data set and applying the framework to it, however, I would like to lay some foundation, provide some context, and share with you how I think about data.

At the most basic level, we can think of data as just encoded information. But the critical thing about this is that information can be found everywhere. Everything you know, everything you can think of, everything you encounter every moment of every day is information that has the potential to be captured, encoded, and therefore turned into data. The ubiquity of these potential data sources turns our world into a web of complexity. We can think of each data set we encounter as a slice of this complexity that has been put together in an attempt to communicate something about the world to a human or a machine.

As humans, we have an inherent ability to deal with this complexity and the vast amounts of information we are constantly receiving. The way we do this is by organizing things and putting things in order. We create categories, hierarchical classifications, taxonomies, tags, and other systems to organize our information. These constructs help provide order and structure to the way we view the world, and they allow us to be able to look at something, recognize the similarities it has to a group of other things, and quickly make decisions about it.

The fact that we have this ability to create order out of complexity is wonderful, and we use it all the time without ever thinking about it. Unfortunately, that includes when we're analyzing data, and that often makes our processes not reproducible, repeatable, and or reliable. So I wanted my data exploration framework to explicitly take advantage of this ability and help people make better use of it in their workflows.

## Our Example Data Set

The data set we will be applying this framework to throughout this series is the Environmental Protection Agency's Vehicle Fuel Economy data set. Here is what the data looks like after some light clean-up.

To get the data looking like this, we first need to download the data to disk.

import os
import zipfile
import requests

if not os.path.exists(path):
os.mkdir(path)

response = requests.get(url)
with open(os.path.join(path, name), 'wb') as f:
f.write(response.content)

z = zipfile.ZipFile(os.path.join(path, 'vehicles.zip'))
z.extractall(path)

VEHICLES = 'http://bit.ly/ddl-cars'



From there, we are going to load it into a pandas data frame.

import pandas as pd

path = 'data'



And finally, we are going to clean it up by dropping columns we don't need, removing vehicles that are coming out in the future, removing any duplicate records, and then sorting the data by make, model, and year.

select_columns = ['make', 'model', 'year', 'displ', 'cylinders', 'trany', 'drive', 'VClass','fuelType','barrels08', 'city08', 'highway08', 'comb08', 'co2TailpipeGpm', 'fuelCost08']

vehicles = vehicles[select_columns][vehicles.year <= 2016].drop_duplicates().dropna()

vehicles = vehicles.sort_values(['make', 'model', 'year'])


## Identify Stage

Now that we have a clean data set, let's jump into the framework, beginning with the Prep Phase. The first thing we're going to do is identify the types of information contained in our data set, which will help us get to know our data a bit better and prepare us to think about the data in different ways. After that, we will identify the entities in our data set so that we are aware of the different levels to which we can aggregate up or drill down.

### Types of Information

There are a few distinct types of information that jump out at us just from taking a quick look at the data set.

• Vehicle attributes information
• Vehicle manufacturer information
• Engine information
• Fuel information (such as fuel efficiency, fuel type, and fuel cost)
• Transmission information
• Drive axle information

There are also some other types of information in our data that may not be as obvious. Since we have the year the vehicle was manufactured, we can observe changes in the data over time. We also have relationship information in the data, both between fields and between the entities. And since we have both a time variable as well as information about relationships, we can learn how those relationships have changed over time.

### Entities in the Data

The next step in the Prep Phase is to identify the entities in our data. Now, what exactly do I mean by entities? When I refer to entities, I'm referring to the individual, analyzable units in a data set. To conduct any type of analysis, you need to be able to distinguish one entity from another and identify differences between them. Entities are also usually part of some hierarchical structure where they can be aggregated into one or more systems, or higher-level entities, to which they belong. Now that we have defined what an entity is, let's take a look at the different levels of them that are present in our data set.

Beginning at Level 1 (which is the most granular level in the data) - you can see the year and specific model of vehicle. The next level we can aggregate up to from there is year and model type, which is slightly less granular. From there, we have a few different directions we can pursue: year and vehicle class, year and make, or we can remove year and only keep model type. Finally, at Level 4, we can further aggregate the data to just the vehicle classes, the years, or the makes.

To illustrate even further, here are some actual examples of entities in our data set.

At Level 1, which was the year and the model, we have a 2016 Ford Mustang with a 2.3 liter V4 engine, automatic transmission, and rear-wheel drive. At Level 2, we can roll things up and look at all 2016 Ford Mustangs as one entity that we're analyzing. Then at Level 3, we can either make our entities all 2016 Subcompact Cars, all 2016 Fords, or all Ford Mustangs regardless of the year they were manufactured. From there, we can continue going up the hierarchy.

Again, doing this is important, and it will help you think about all the things you can do to the data and all the different ways you can look at it later on. I see a lot of people that are new to data science who don't do this. They don't think about their data this way, and because of that, they end up missing valuable insights that they would have otherwise discovered. I hope that these examples help make it easier to think about data this way.

## Review Stage

The next step in the Prep Phase is to review some transformation and visualization methods. Doing this will ensure that we are aware of the tools we have in our analytic arsenal, what they should be used for, and when to utilize each one.

### Transformation Methods

The first methods we will cover are the transformation methods. Let's take a look at some of my favorite ways to transform data.

• Filtering
• Aggregation/Disaggregation
• Pivoting
• Graph Transformation

The first method I have listed here is Filtering, which is making the data set smaller by looking at either fewer rows, fewer columns, or both. The next method on the list is Aggregation/Disaggregation. This is the process of changing the levels at which you are analyzing the data, getting either more or less granular. Then we have Pivoting, which is the process of aggregating by multiple variables along two axes - the rows and the columns. Finally, we have Graph Transformation, which is the process of linking your entities based on shared attributes and examining how they relate to one another.

By transforming the data, you are ultimately altering its structure, which allows you to look at it from several perspectives. And just like looking at anything else from different perspectives, you will learn something new from each way that you view it. The remarkable thing about this is that the number of ways you can transform the data is limited only by your creativity and your imagination. This, for me, is one of the most exciting things about working with data - all the things you can do to it and all the creative ways that you can transform it.

### Visualization Methods

In addition to transforming the data, I also like to go a step further and visualize it, as sometimes the transformations you perform can be difficult to interpret. Converting your transformations to visualizations allows you to bring the human visual cortex into your analytical process, and I often find that this helps me find more insights faster, since the visual component of it makes the insights jump right out at me.

Because of this, transformation and visualization go hand-in-hand. Since there are a variety of ways to transform data, there are also several ways you can visualize it. I like to keep things relatively simple, so here are some of the visualization methods I use most often.

• Bar charts
• Multi-line Graphs
• Scatter plots/matrices
• Heatmaps
• Network Visualizations

The first visualization method on the list is Bar charts, which help you intuitively view aggregations by comparing the size or magnitude of higher-level entities in the data. Bar charts are simple, but they can be very useful, which is why they are one of the most popular types of visualization methods. Next, we have Multi-line Graphs, which are usually used to show changes over time or some other measure, where each line typically represents a higher-level entity whose behavior you are comparing.

The third method on the list is a combination of Scatter Plots and Scatter Matrices. Using scatter plots, you can view relationships and correlations between two numeric variables in your data set at a time. Scatter matrices are simply a matrix of scatter plots, so they allow you to view the relationships and correlations between all your numeric variables in a single visualization.

The fourth visualization method listed are Heatmaps, which allow you to view the concentration, magnitude, or other calculated value of entities that fall into different combinations of categories in your data. Last, but certainly not least, we have Network Visualizations, which bring graph transformations to life and let you visualize relationships between the entities in your data via a collection of visual nodes and edges.

We will cover all of these visualization methods in more depth and show examples of them in the other posts in this series.

## Conclusion

In this post, we have developed a way of thinking about data, both in general and for our example data set, which will help us explore our data in a creative but structured way. By this point, you should have foundational knowledge about your data set, as well as some transformation and visualization methods available to you, so that you can quickly deploy them when necessary.

The goal of this post was to prepare you, the analyst or data scientist, for exploring your data. In the next post, we will prepare the data itself to be explored. We will move to the last step of the Prep Phase and come up with ways to create additional categories that will help us explore our data from various perspectives. We will do this in several ways, some of which you may have seen or used before and some of which you may not have realized were possible. So make sure to stay tuned!

### Book Memo: “Algorithmic Mathematics”

 Algorithms play an increasingly important role in nearly all fields of mathematics. This book allows readers to develop basic mathematical abilities, in particular those concerning the design and analysis of algorithms as well as their implementation. It presents not only fundamental algorithms like the sieve of Eratosthenes, the Euclidean algorithm, sorting algorithms, algorithms on graphs, and Gaussian elimination, but also discusses elementary data structures, basic graph theory, and numerical questions. In addition, it provides an introduction to programming and demonstrates in detail how to implement algorithms in C++. This textbook is suitable for students who are new to the subject and covers a basic mathematical lecture course, complementing traditional courses on analysis and linear algebra. Both authors have given this ‘Algorithmic Mathematics’ course at the University of Bonn several times in recent years.

### If you did not already know: “Sparse Linear Method (SLIM)”

This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse LInear Method (SLIM) is proposed, which generates topN recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an l1-norm and l2-norm regularized optimization problem. W is demonstrated to produce highquality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-ofthe-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
GitXiv
Sparse Linear Method (SLIM)

### Magister Dixit

“Unlike SAP HANA, Hadoop won’t help you understand your business at ‘the speed of thought.’ But it lets you store and access more voluminous, detailed data at lower cost so you can drill deeper and in different ways into your business data.” SAP ( 2013 )

### Document worth reading: “Rules of Machine Learning: Best Practices for ML Engineering”

This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machinelearned model, then you have the necessary background to read this document. Rules of Machine Learning: Best Practices for ML Engineering

### R Packages worth a look

Forward Search using asymptotic theory (ForwardSearch)
Forward Search analysis of time series regressions. Implements the asymptotic theory developed in Johansen and Nielsen (2013, 2014).

Functional Linear Mixed Models for Irregularly or Sparsely Sampled Data (sparseFLMM)
Estimation of functional linear mixed models for irregularly or sparsely sampled data based on functional principal component analysis.

Programming with Big Data — Remote Procedure Call (pbdRPC)
A very light implementation yet secure for remote procedure calls with unified interface via ssh (OpenSSH) or plink/plink.exe (PuTTY).

Interactive Biplots in R (BiplotGUI)
Provides a GUI with which users can construct and interact with biplots.

Empirical Bayes Ranking (EBrank)
Empirical Bayes ranking applicable to parallel-estimation settings where the estimated parameters are asymptotically unbiased and normal, with known standard errors. A mixture normal prior for each parameter is estimated using Empirical Bayes methods, subsequentially ranks for each parameter are simulated from the resulting joint posterior over all parameters (The marginal posterior densities for each parameter are assumed independent). Finally, experiments are ordered by expected posterior rank, although computations minimizing other plausible rank-loss functions are also given.

General-to-Specific (GETS) Modelling and Indicator Saturation Methods (gets)
Automated multi-path General-to-Specific (GETS) modelling of the mean and variance of a regression, and indicator saturation methods for detecting structural breaks in the mean. The mean can be specified as an autoregressive model with covariates (an ‘AR-X’ model), and the variance can be specified as a log-variance model with covariates (a ‘log-ARCH-X’ model). The four main functions of the package are arx, getsm, getsv and isat. The first function, arx, estimates an AR-X model with log-ARCH-X errors. The second function, getsm, undertakes GETS model selection of the mean specification of an arx object. The third function, getsv, undertakes GETS model selection of the log-variance specification of an arx object. The fourth function, isat, undertakes GETS model selection of an indicator saturated mean specification.

Multiple Hot-deck Imputation (hot.deck)
Performs multiple hot-deck imputation of categorical and continuous variables in a data frame.

Hot Deck Imputation Methods for Missing Data (HotDeckImputation)
This package provides hot deck imputation methods to resolve missing data.

### An example that isn't that artificial or intelligent

Editor’s note: This is the second chapter of a book I’m working on called Demystifying Artificial Intelligence. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only two chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this amazing tweet by Twitter user @notajf. Feedback is welcome and encouraged!

“I am so clever that sometimes I don’t understand a single word of what I am saying.” Oscar Wilde

As we have described it artificial intelligence applications consist of three things:

1. A large collection of data examples
2. An algorithm for learning a model from that training set.
3. An interface with the world.

In the following chapters we will go into each of these components in much more detail, but lets start with a a couple of very simple examples to make sure that the components of an AI are clear. We will start with a completely artificial example and then move to more complicated examples.

## Building a album

Lets start with a very simple hypothetical example that can be understood even if you don’t have a technical background. We can also use this example to define some of the terms we will be discussing later in the book.

In our simple example the goal is to make an album of photos for a friend. For example, suppose I want to take the photos in my photobook and find all the ones that include pictures of myself and my son Dex for his grandmother.

If you are anything like the author of this book, then you probably have a very large number of pictures of your family on your phone. So the first step in making the photo alubm would be to stort through all of my pictures and pick out the ones that should be part of the album.

This is a typical example of the type of thing we might want to train a computer to do in an artificial intelligence application. Each of the components of an AI application is there:

1. The data: all of the pictures on the author’s phone (a big training set!)
2. The algorithm: finding pictures of me and my son Dex
3. The interface: the album to give to Dex’s grandmother.

One way to solve this problem is for me to sort through the pictures one by one and decide whether they should be in the album or not, then assemble them together, and then put them into the album. If I did it like this then I myself would be the AI! That wouldn’t be very artificial though…imagine we instead wanted to teach a computer to make this album..

But what does it mean to “teach” a computer to do something?

The terms “machine learning” and “artificial intelligence” invoke the idea of teaching computers in the same way that we teach children. This was a deliberate choice to make the analogy - both because in some ways it is appropriate and because it is useful for explaining complicated concepts to people with limited backgrounds. To teach a child to find pictures of the author and his son, you would show her lots of examples of that type of picture and maybe some examples of the author with other kids who were not his son. You’d repeat to the child that the pictures of the author and his son were the kinds you wanted and the others weren’t. Eventually she would retain that information and if you gave her a new picture she could tell you whether it was the right kind or not.

To teach a machine to perform the same kind of recognition you go through a similar process. You “show” the machine many pictures labeled as either the ones you want or not. You repeat this process until the machine “retains” the information and can correctly label a new photo. Getting the machine to “retain” this information is a matter of getting the machine to create a set of step by step instructions it can apply to go from the image to the label that you want.

## The data

The images are what people in the fields of artificial intelligence and machine learning call “raw data” (Leek, n.d.). The categories of pictures (a picture of the author and his son or a picture of something else) are called the “labels” or “outcomes”. If the computer gets to see the labels when it is learning then it is called “supervised learning” (Wikipedia contributors 2016) and when the computer doesn’t get to see the labels it is called “unsupervised learning” (Wikipedia contributors 2017a).

Going back to our analogy with the child, supervised learning would be teaching the child to recognize pictures of the author and his son together. Unsupervised learning would be giving the child a pile of pictures and asking them to sort them into groups. They might sort them by color or subject or location - not necessarily into categories that you care about. But probably one of the categories they would make would be pictures of people - so she would have found some potentially useful information even if it wasn’t exactly what you wanted. One whole field of artificial intelligence is figuring out how to use the information learned in this “unsupervised” setting and using it for supervised tasks - this is sometimes called “transfer learning” (Raina et al. 2007) by people in the field since you are transferring information from one task to another.

Returning to the task of “teaching” a computer to retain information about what kind of pictures you want we run into a problem - computers don’t know what pictures are! They also don’t know what audio clips, text files, videos, or any other kind of information is. At least not directly. They don’t have eyes, ears, and other senses along with a brain designed to decode the information from these senses.

So what can a computer understand? A good rule of thumb is that a computer works best with numbers. If you want a computer to sort pictures into an album for you, the first thing you need to do is to find a way to turn all of the information you want to “show” the computer into numbers. In the case of sorting pictures into albums - a supervised learning problem - we need to turn the labels and the images into numbers the computer can use.

One way to do that would be for you to do it for the computer. You could take every picture on your phone and label it with a 1 if it was a picture of the author and his son and a 0 if not. Then you would have a set of 1’s and 0’s corresponding to all of the pictures. This takes some thing the computer can’t understand (the picture) and turns it into something the computer can understand (the label).

This process would turn the labels into something a computer could understand, it still isn’t something we could teach a computer to do. The computer can’t actually “look” at the image and doesn’t know who the author or his son are. So we need to figure out a way to turn the images into numbers for the computer to use to generate those labels directly.

This is a little more complicated but you could still do it for the computer. Let’s suppose that the author and his son always wear matching blue shirts when they spend time together. Then you could go through and look at each image and decide what fraction of the image is blue. So each picture would get a number ranging from zero to one like 0.30 if the picture was 30% blue and 0.53 if it was 53% blue.

The fraction of the picture that is blue is called a “feature” and the process of creating that feature is called “feature engineering” (Wikipedia contributors 2017b). Until very recently feature engineering of text, audio, or video files was best performed by an expert human. In later chapters we will discuss how one of the most exciting parts about AI application is that it is now possible to have computers perform feature engineering for you.

## The algorithm

Now that we have converted the images to numbers and the labels to numbers, we can talk about how to “teach” a computer to label the pictures. A good rule of thumb when thinking about algorithms is that a computer can’t “do” anything without being told very explicitly what to do. It needs a step by step set of instructions. The instructions should start with a calculation on the numbers for the image and should end with a prediction of what label to apply to that image. The image (converted to numbers) is the “input” and the label (also a number) is the “output”. You may have heard the phrase:

“Garbage in, garbage out”

What this phrase means is if the inputs (the images) are bad - say they are all very dark or hard to see. Then the output of the algorithm will also be bad - the predictions won’t be very good.

A machine learning “algorithm” can be thought of as a set of instructions with some of the parts left blank - sort of like mad-libs. One example of a really simple algorithm for sorting pictures into the album would be:

1. Calculate the fraction of blue in the image.
2. If the fraction of blue is above X label it 1
3. If the fraction of blue is less than X label it 0
4. Put all of the images labeled 1 in the album

The machine “learns” by using the examples to fill in the blanks in the instructions. In the case of our really simple algorithm we need to figure out what fraction of blue to use (X) for labeling the picture.

To figure out a guess for X we need to decide what we want the algorithm to do. If we set X to be too low then all of the images will be labeled with a 1 and put into the album. If we set X to be too high then all of the images will be labeled 0 and none will appear in the album. In between there is some grey area - do we care if we accidentally get some pictures of the ocean or the sky with our algorithm?

But the number of images in the album isn’t even the thing we really care about. What we might care about is making sure that the album is mostly pictures of the author and his son. In the field of AI they usually turn this statement around - we want to make sure the album has a very small fraction of pictures that are not of the author and his son. This fraction - the fraction that are incorrectly placed in the album is called the “loss”. You can think about it like a game where the computer loses a point every time it puts the wrong kind of picture into the album.

Using our loss (how many pictures we incorrectly placed in the album) we can now use the data we have created (the numbers for the labels and the images) to fill in the blanks in our mad-lib algorithm (picking the cutoff on the amount of blue). We have a large number of pictures where we know what fraction of each picture is blue and whether it is a picture of the author and his son or not. We can try each possible X and calculate the fraction of pictures in the album that are incorrectly placed into the album (the loss) and find the X that produces the smallest fraction.

Suppose that the value of X that gives the smallest faction of wrong pictures in the album is 30. Then our “learned” model would be:

1. Calculate the fraction of blue in the image
2. If the fraction of blue is above 0.1 label it 1
3. If the fraction of blue is less than 0.1 label it 0
4. Put all of the images labeled 1 in the album

## The interface

The last part of an AI application is the interface. In this case, the interface would be the way that we share the pictures with Dex’s grandmother. For example we could imagine uploading the pictures to Shutterfly and having the album delivered to Dex’s grandmother.

Putting this all together we could imagine an application using our trained AI. The author uploads his unlabeled photos. The photos are then passed to the computer program which calculates the fraction of the image that is blue, then applies a label according to the algorithm we learned, then takes all the images predicted to be of the author and his son and sends them off to be a Shutterfly album mailed to the authors’ mother.

If the algorithm was good, then from the perspective of the author the website would look “intelligent”. I just uploaded pictures and it created an album for me with the pictures that I wanted. But the steps in the process were very simple and understandable behind the scenes.

## References

Leek, Jeffrey. n.d. “The Elements of Data Analytic Style.” {https://leanpub.com/datastyle}.

Raina, Rajat, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. 2007. “Self-Taught Learning: Transfer Learning from Unlabeled Data.” In Proceedings of the 24th International Conference on Machine Learning, 759–66. ICML ’07. New York, NY, USA: ACM.

Wikipedia contributors. 2016. “Supervised Learning.” https://en.wikipedia.org/w/index.php?title=Supervised_learning&oldid=752493505.

———. 2017a. “Unsupervised Learning.” https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&oldid=760556815.

———. 2017b. “Feature Engineering.” https://en.wikipedia.org/w/index.php?title=Feature_engineering&oldid=760758719.

## January 19, 2017

### Microsoft R Server in the News

Since the release of Microsoft R Server 9 last month, there's been quite a bit of news in the tech press about the capabilities it provides for using R in production environments.

Infoworld's article, Microsoft’s R tools bring data science to the masses, takes a look back at Microsoft's vision for R since acquiring Revolution Analytics two years ago, and notes that now "R is everywhere in Microsoft’s ecosystem". The article gives some background on open source R, and describes the benefits of using it within Microsoft R Open, Microsoft R Server and SQL Server 2016 R Services.

ZDNet's article, Microsoft's R Server 9: more predictive analytics, in more places, focuses on some of the major new features including the MicrosoftML package, the new Swagger API for R function deployment, and support for Spark 2.0. It also notes that the integration with SQL Server means that "predictive analytics capabilities are now available ... to an entire generation of application developers".

Computerworld's article, Microsoft pushes R, SQL Server integration, focused on the operationalization capabilities for integrating R into production workflows, such as the new publishServices function. It also mentioned the various problem-specific solutions on GitHub, including the new Marketing Campaign Optimization template.

With SQL Server integration as a key component of the platform, you may also be interested in this blog post from the development team: SQL Server R Services – Why we built it.

### Whats new on arXiv

The goal of this paper is not to introduce a single algorithm or method, but to make theoretical steps towards fully understanding the training dynamics of generative adversarial networks. In order to substantiate our theoretical analysis, we perform targeted experiments to verify our assumptions, illustrate our claims, and quantify the phenomena. This paper is divided into three sections. The first section introduces the problem at hand. The second section is dedicated to studying and proving rigorously the problems including instability and saturation that arize when training generative adversarial networks. The third section examines a practical and theoretically grounded direction towards solving these problems, while introducing new tools to study them.
We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized ‘labeled’ data, and (ii) a much larger sized ‘unlabeled’ data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. It is often of interest to investigate if and when the unlabeled data can be exploited to improve estimation of the regression parameter in the adopted linear model. In this paper, we propose a class of ‘Efficient and Adaptive Semi-Supervised Estimators’ (EASE) to improve estimation efficiency. The EASE are two-step estimators adaptive to model mis-specification, leading to improved (optimal in some cases) efficiency under model mis-specification, and equal (optimal) efficiency under a linear model. This adaptive property, often unaddressed in the existing literature, is crucial for advocating ‘safe’ use of the unlabeled data. The construction of EASE primarily involves a flexible ‘semi-non-parametric’ imputation, including a smoothing step that works well even when the number of covariates is not small; and a follow up ‘refitting’ step along with a cross-validation (CV) strategy both of which have useful practical as well as theoretical implications towards addressing two important issues: under-smoothing and over-fitting. We establish asymptotic results including consistency, asymptotic normality and the adaptive properties of EASE. We also provide influence function expansions and a ‘double’ CV strategy for inference. The results are further validated through extensive simulations, followed by application to an EMR study on auto-immunity.
For a social networking service to acquire and retain users, it must find ways to keep them engaged. By accurately gauging their preferences, it is able to serve them with the subset of available content that maximises revenue for the site. Without the constraints of an appropriate regulatory framework, we argue that a sufficiently sophisticated curator algorithm tasked with performing this process may choose to explore curation strategies that are detrimental to users. In particular, we suggest that such an algorithm is capable of learning to manipulate its users, for several qualitative reasons: 1. Access to vast quantities of user data combined with ongoing breakthroughs in the field of machine learning are leading to powerful but uninterpretable strategies for decision making at scale. 2. The availability of an effective feedback mechanism for assessing the short and long term user responses to curation strategies. 3. Techniques from reinforcement learning have allowed machines to learn automated and highly successful strategies at an abstract level, often resulting in non-intuitive yet nonetheless highly appropriate action selection. In this work, we consider the form that these strategies for user manipulation might take and scrutinise the role that regulation should play in the design of such systems.
An agglomerative clustering of random variables is proposed, where clusters of random variables sharing the maximum amount of multivariate mutual information are merged successively to form larger clusters. Compared to the previous info-clustering algorithms, the agglomerative approach allows the computation to stop earlier when clusters of desired size and accuracy are obtained. An efficient algorithm is also derived based on the submodularity of entropy and the duality between the principal sequence of partitions and the principal sequence for submodular functions.
This paper presents an alternative approach to p-values in regression settings. This approach, whose origins can be traced to machine learning, is based on the leave-one-out bootstrap for prediction error. In machine learning this is called the out-of-bag (OOB) error. To obtain the OOB error for a model, one draws a bootstrap sample and fits the model to the in-sample data. The out-of-sample prediction error for the model is obtained by calculating the prediction error for the model using the out-of-sample data. Repeating and averaging yields the OOB error, which represents a robust cross-validated estimate of the accuracy of the underlying model. By a simple modification to the bootstrap data involving ‘noising up’ a variable, the OOB method yields a variable importance (VIMP) index, which directly measures how much a specific variable contributes to the prediction precision of a model. VIMP provides a new scientifically interpretable measure of the effect size of a variable, we call the ‘predictive effect size’, that holds whether the researcher’s model is correct or not, unlike the p-value whose calculation is based on the assumed correctness of the model. We also introduce a marginal VIMP index, also easily calculated, which measures the marginal effect of a variable, or what we call ‘the discovery effect’. Our OOB procedure can be applied to both parametric and nonparametric regression models and requires only that the researcher can repeatedly fit their model to bootstrap and modified bootstrap data. We illustrate this approach on a survival data set involving patients with systolic heart failure and to a simulated survival data set where the model is incorrectly specified to illustrate its robustness to model misspecification.
Artificial Neural Networks(ANN) has been phenomenally successful on various pattern recognition tasks. However, the design of neural networks rely heavily on the experience and intuitions of individual developers. In this article, the author introduces a mathematical structure called MLP algebra on the set of all Multilayer Perceptron Neural Networks(MLP), which can serve as a guiding principle to build MLPs accommodating to the particular data sets, and to build complex MLPs from simpler ones.
Internship assignment is a complicated process for universities since it is necessary to take into account a multiplicity of variables to establish a compromise between companies’ requirements and student competencies acquired during the university training. These variables build up a complex relations map that requires the formulation of an exhaustive and rigorous conceptual scheme. In this research a domain ontological model is presented as support to the student’s decision making for opportunities of University studies level of the University Lumiere Lyon 2 (ULL) education system. The ontology is designed and created using methodological approach offering the possibility of improving the progressive creation, capture and knowledge articulation. In this paper, we draw a balance taking the demands of the companies across the capabilities of the students. This will be done through the establishment of an ontological model of an educational learners’ profile and the internship postings which are written in a free text and using uncontrolled vocabulary. Furthermore, we outline the process of semantic matching which improves the quality of query results.
In this work we present a novel recurrent neural network architecture designed to model systems characterized by multiple characteristic timescales in their dynamics. The proposed network is composed by several recurrent groups of neurons that are trained to separately adapt to each timescale, in order to improve the system identification process. We test our framework on time series prediction tasks and we show some promising, preliminary results achieved on synthetic data. To evaluate the capabilities of our network, we compare the performance with several state-of-the-art recurrent architectures.

### Distilled News

CRISP-DM is the leading approach for managing data mining, predictive analytic and data science projects. CRISP-DM is effective but many analytic projects neglect key elements of the approach.
One of the most fundamental question for scientists across the globe has been – “How to learn a new skill?”. The desire to understand the answer is obvious – if we can understand this, we can enable human species to do things we might not have thought before. Alternately, we can train machines to do more “human” tasks and create true artificial intelligence. While we don’t have a complete answer to the above question yet, there are a few things which are clear. Irrespective of the skill, we first learn by interacting with the environment. Whether we are learning to drive a car or whether it an infant learning to walk, the learning is based on the interaction with the environment. Learning from interaction is the foundational underlying concept for all theories of learning and intelligence.
This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC.
This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC.
Welcome to fast.ai’s 7 week course, Practical Deep Learning For Coders, Part 1, taught by Jeremy Howard (Kaggle’s #1 competitor 2 years running, and founder of Enlitic). Learn how to build state of the art models without needing graduate-level math—but also without dumbing anything down. Oh and one other thing… it’s totally free!
Given the fact that it’s one of the fundamental packages for scientific computing, NumPy is one of the packages that you must be able to use and know if you want to do data science with Python. It offers a great alternative to Python lists, as NumPy arrays are more compact, allow faster access in reading and writing items, and are more convenient and more efficient overall.

### Webinar: Predictive Analytics: Failure to Launch – Feb 14

Learn how to get started with predictive modeling and overcome strategic and tactical limitations that cause data mining projects to fall short of their potential. Next webinar is Feb 14.

### What makes pattern matching and “implicits” in Scala so powerful?

Learn how to enable functional behavior in Scala code.

In my online class, Scala: Beyond the Basics, I discuss two very important aspects to functional programming in Scala. The first is pattern matching, which is the ability to take a pattern of how things are constructed and tear them apart and make decisions based on the parts that make up the item. For example, if I were to create a student registration system for a middle school, I may want to deconstruct the student (not literally) to find out the student’s last name and then put them into the corresponding line (i.e., A-F, G-N, O-Z).

Another aspect I talk about in the course are implicits. Implicits can be a powerful ally in functional programming, but, when used incorrectly, they can turn on you and make debugging your code difficult. With implicits, you can add methods to objects that already exist, establish conversion strategies, and even resurrect parameterized types that have been erased at runtime.

### Search results: Careers in high tech

I was recently interviewed for a piece for ScienceMag about careers in high tech. You can find the original post on ScienceMag.

With big data becoming increasingly popular and relevant,  data scientist jobs are opening up across every industry in virtually every corner of the globe. Unfortunately, the multitude of available positions isn’t making it any easier to actually land a job as a data scientist. Competition is abundant, interviews can be lengthy and arduous, and good ideas aren’t enough to get yourself hired. Michael Li  emphasizes that technical know-how is what hiring managers crave. “No one needs just an ‘ideas’ person. They need someone who can actually get the job done.”

This shouldn’t discourage anyone from pursuing a career in data science because it can be both rewarding and profitable. If you’re looking to brush up your skills and jump start your career, consider applying for our free data science fellowship with offerings now in San Francisco, New York, Washington DC, Boston, and Seattle. Learn more and apply on our website.

### Support Vector Machine Classifier Implementation in R with caret package

Support Vector Machine Implementation in R Programming Language

# Support Vector Machine Classifier implementation in R with caret package

In the introduction to support vector machine classifier article, we learned about the key aspects as well as the mathematical foundation behind svm classifier. In this article, we are going to build a Support Vector Machine Classifier using R programming language. To build the Svm classifier we were going to use the  R machine learning caret package.

As we discussed the core concepts behind  SVM algorithm in our previous post it will be the great move to implement the concepts we have learned. If you don’t have the basic understanding of an SVM algorithm, it’s suggested to read our introduction to support vector machines article.

## Svm Classifier implementation in R

For SVM classifier implementation in R programming language using caret package, we are going to examine a tidy dataset of Heart Disease. Our motive is to predict whether a patient is having heart disease or not.

To work on big datasets, we can directly use some machine learning packages. Developer community of R programming language has built some great packages to make our work easier. The beauty of these packages is that they are well optimized and can handle maximum exceptions to make our job simple, we just need to call functions for implementing algorithms with the right parameters.

For machine learning, caret package is a nice package with proper documentation. For Implementing support vector machine, we can use caret or e1071 package etc.

The principle behind an SVM classifier (Support Vector Machine) algorithm is to build a hyperplane separating data for different classes. This hyperplane building procedure varies and is the main task of an SVM classifier. The main focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. These nearest data points are known as Support Vectors.

## Caret Package Installation

The R programming machine learning caret package( Classification And REgression Training ) holds tons of functions that helps to build predictive models. It holds tools for data splitting, pre-processing, feature selection, tuning and supervised – unsupervised learning algorithms, etc. It is similar to sklearn library in python.

For using it, we first need to install it. Open R console and install it by typing:

install.packages(“caret”)

caret package provides us direct access to various functions for training our model with various machine learning algorithms like Knn, SVM, decision tree, linear regression, etc.

### Heart Disease Recognition Data Set Description

Heart Disease data set consists of 14 attributes data. All the attributes consist of numeric values. First 13 variables will be used for predicting 14th variables. The target variable is at index 14.

 Feature Title Variable Data Type Feature Categorization 1. age Continuous Variable 29 – 77 2. sex Categorical Variable 1 = male; 0 = female 3. cp: chest pain type Categorical Variable 1: typical angina 2: atypical angina 3: non-anginal pain 4: asymptomatic 4. trestbps:  resting blood pressure Continuous Variable 94 – 200 5. chol: serum cholestoral Continuous Variable 126 – 564 6. fbs: fasting blood sugar > 120 mg/dl Categorical Variable 1 = true; 0 = false 7. restecg:resting ECG results Categorical Variable 0: normal 1: having ST-T wave abnormality 8. thalach: maximum heart rate achieved Continuous Variable 71 – 202 9. exang: exercise-induced angina Categorical Variable 1 = yes; 0 = no 10. oldpeak:  ST depression induced by exercise relative to rest Continuous Variable 0 – 6.2 11. slope: slope of the peak exercise ST segment Continuous Variable 1 – 3 12. ca:  number of major vessels Continuous Variable 0 – 3 13. thal Categorical Variable 3 = normal; 6 = fixed defect; 7 = reversible defect 14 Target Variable Categorical Variable 0: Absence of Heart Disease 1:  Presence of Heart Disease

The above table shows all the details of data.

### Heart Disease Recognition Problem Statement

To model a classifier for predicting whether a patient is suffering from any heart disease or not.

## SVM classifier implementation in R with Caret Package

### R caret Library:

For implementing SVM in r, we only need to import caret package. As we mentioned above, it helps to perform various tasks to perform our machine learning work. Just past the below command in R console to import r machine learning package Caret.

library(caret)

#### Data Import

For importing the data and manipulating it, we are going to use data frames. First of all, we need to download the dataset. You can download the dataset our repository. It’s a CSV file i.e, Comma Separated Values file. All the data values are separated by commas.  After downloading the CSV file, you need to set your working directory via console else save the data file in your current working directory.

You can get the path of your current working by running the command getwd() in R console. If you wish to change your working directory then follow the below command to get your task completed.

setwd(<PATH of NEW Working Directory>)

heart_df <- read.csv("heart_tidy.csv", sep = ',', header = FALSE)

For importing data into an R data frame, we can use read.csv() method with parameters as a file name and whether our dataset consists of the 1st row with a header or not. If a header row exists then, the header should be set TRUE else header should set to FALSE.

For checking the structure of data frame we can call the function str() over heart_df:

> str(heart_df)
'data.frame':	300 obs. of  14 variables:
$V1 : int 63 67 67 37 41 56 62 57 63 53 ...$ V2 : int  1 1 1 1 0 1 0 0 1 1 ...
$V3 : int 1 4 4 3 2 2 4 4 4 4 ...$ V4 : int  145 160 120 130 130 120 140 120 130 140 ...
$V5 : int 233 286 229 250 204 236 268 354 254 203 ...$ V6 : int  1 0 0 0 0 0 0 0 0 1 ...
$V7 : int 2 2 2 0 2 0 2 0 2 2 ...$ V8 : int  150 108 129 187 172 178 160 163 147 155 ...
$V9 : int 0 1 1 0 0 0 0 1 0 1 ...$ V10: num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$V11: int 3 2 2 3 1 1 3 1 2 3 ...$ V12: int  0 3 2 0 0 0 2 0 1 0 ...
$V13: int 6 3 7 3 3 3 3 3 7 7 ...$ V14: int  0 1 1 0 0 0 1 0 1 1 ...

The above output shows us that our dataset consists of 300 observations each with 14 attributes.

To check top 5-6 rows of the dataset, we can use head().

> head(heart_df)
V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
1 63  1  1 145 233  1  2 150  0 2.3   3   0   6   0
2 67  1  4 160 286  0  2 108  1 1.5   2   3   3   1
3 67  1  4 120 229  0  2 129  1 2.6   2   2   7   1
4 37  1  3 130 250  0  0 187  0 3.5   3   0   3   0
5 41  0  2 130 204  0  2 172  0 1.4   1   0   3   0
6 56  1  2 120 236  0  0 178  0 0.8   1   0   3   0

The Range of values of the attributes are different but all attributes consist of numeric data.

#### Data Slicing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.

set.seed(3033)
intrain <- createDataPartition(y = heart_df$V14, p= 0.7, list = FALSE) training <- heart_df[intrain,] testing <- heart_df[-intrain,] The set.seed() method is used to make our work replicable. As we want our readers to learn concepts by coding these snippets. To make our answers replicable, we need to set a seed value. During partitioning of data, it splits randomly but if our readers will pass the same value in the set.seed() method. Then we both will get identical results. The caret package provides a method createDataPartition() for partitioning our data into train and test set. We are passing 3 parameters. The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V14, so we are passing heart_df$V14 (heart data frame’s V14 column).

The “p” parameter holds a decimal value in the range of 0-1. It’s to show that percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list. The createDataPartition() method is returning a matrix “intrain” with record’s indices.

By passing values of intrain, we are splitting training data and testing data.
The line training <- heart_df[intrain,] is for putting the data from data frame to training data. Remaining data is saved in the testing data frame, testing <- heart_df[-intrain,].

For checking the dimensions of our training data frame and testing data frame, we can use these.

> dim(training); dim(testing);
[1] 210  14
[1] 90 14

#### Preprocessing & Training

Preprocessing is all about correcting the problems in data before building a machine learning model using that data. Problems can be of many types like missing values, attributes with a different range, etc.

To check whether our data contains missing values or not, we can use anyNA() method. Here, NA means Not Available.

> anyNA(heart_df)
[1] FALSE

Since it’s returning FALSE, it means we don’t have any missing values.

#### Dataset summarized details

For checking the summarized details of our data, we can use summary() method. It will give us a basic idea about our dataset’s attributes range.

> summary(heart_df)
V1              V2             V3              V4              V5
Min.   :29.00   Min.   :0.00   Min.   :1.000   Min.   : 94.0   Min.   :126.0
1st Qu.:48.00   1st Qu.:0.00   1st Qu.:3.000   1st Qu.:120.0   1st Qu.:211.0
Median :56.00   Median :1.00   Median :3.000   Median :130.0   Median :241.5
Mean   :54.48   Mean   :0.68   Mean   :3.153   Mean   :131.6   Mean   :246.9
3rd Qu.:61.00   3rd Qu.:1.00   3rd Qu.:4.000   3rd Qu.:140.0   3rd Qu.:275.2
Max.   :77.00   Max.   :1.00   Max.   :4.000   Max.   :200.0   Max.   :564.0
V6               V7               V8              V9
Min.   :0.0000   Min.   :0.0000   Min.   : 71.0   Min.   :0.0000
1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.8   1st Qu.:0.0000
Median :0.0000   Median :0.5000   Median :153.0   Median :0.0000
Mean   :0.1467   Mean   :0.9867   Mean   :149.7   Mean   :0.3267
3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0   3rd Qu.:1.0000
Max.   :1.0000   Max.   :2.0000   Max.   :202.0   Max.   :1.0000
V10            V11             V12            V13             V14
Min.   :0.00   Min.   :1.000   Min.   :0.00   Min.   :3.000   Min.   :0.00
1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.00   1st Qu.:3.000   1st Qu.:0.00
Median :0.80   Median :2.000   Median :0.00   Median :3.000   Median :0.00
Mean   :1.05   Mean   :1.603   Mean   :0.67   Mean   :4.727   Mean   :0.46
3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.00   3rd Qu.:7.000   3rd Qu.:1.00
Max.   :6.20   Max.   :3.000   Max.   :3.00   Max.   :7.000   Max.   :1.00

From above summary statistics, it shows us that all the attributes have a different range. So, we need to standardize our data. We can standardize data using caret’s preProcess() method.

Our target variable consists of 2 values 0, 1. It should be a categorical variable. To convert these to categorical variables, we can convert them to factors.

training[["V14"]] = factor(training[["V14"]])

The above line of code will convert training data frame’s “V14” column to factor variable.

### Training the SVM model

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3233)

svm_Linear <- train(V14 ~., data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” with many values like  “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this tutorial, let’s try to use repeatedcv i.e, repeated cross-validation.

The “number” parameter holds the number of resampling iterations. The “repeats ” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.

Before training our SVM classifier, set.seed().

For training SVM classifier, train() method should be passed with “method” parameter as “svmLinear”. We are passing our target variable V14. The “V14~.” denotes a formula for using all attributes in our classifier and V14 as the target variable. The “trControl” parameter should be passed with results from our trianControl() method. The “preProcess”  parameter is for preprocessing our training data.

As discussed earlier for our data, preprocessing is a mandatory task. We are passing 2 values in our “preProcess” parameter “center” & “scale”. These two help for centering and scaling the data. After preProcessing these convert our training data with mean value as approximately “0” and standard deviation as “1”. The “tuneLength” parameter holds an integer value. This is for tuning our algorithm.

#### Trained SVM model result

You can check the result of our train() method. We are saving its results in a svm_Linear variable.

> svm_Linear
Support Vector Machines with Linear Kernel

210 samples
13 predictor
2 classes: '0', '1'

Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ...
Resampling results:

Accuracy  Kappa
0.815873  0.62942

Tuning parameter 'C' was held constant at a value of 1

It’s a linear model therefore, it just tested at value “C” =1.

### Test Set Prediction

Now, our model is trained with C value as 1. We are ready to predict classes for our test set. We can use predict() method.

> test_pred <- predict(svm_Linear, newdata = testing)
> test_pred
[1] 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0
[45] 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0
[89] 1 0
Levels: 0 1

The caret package provides predict() method for predicting results. We are passing 2 arguments. Its first parameter is our trained model and second parameter “newdata” holds our testing data frame. The predict() method returns a list, we are saving it in a test_pred variable.

#### How Accurately our model is working?

Using confusion matrix, we can print statistics of our results. It shows that our model accuracy for test set is 86.67%.

> confusionMatrix(test_pred, testing$V14 ) Confusion Matrix and Statistics Reference Prediction 0 1 0 45 5 1 7 33 Accuracy : 0.8667 95% CI : (0.7787, 0.9292) No Information Rate : 0.5778 P-Value [Acc > NIR] : 2.884e-09 Kappa : 0.7286 Mcnemar's Test P-Value : 0.7728 Sensitivity : 0.8654 Specificity : 0.8684 Pos Pred Value : 0.9000 Neg Pred Value : 0.8250 Prevalence : 0.5778 Detection Rate : 0.5000 Detection Prevalence : 0.5556 Balanced Accuracy : 0.8669 'Positive' Class : 0 By following the above procedure we can build our svmLinear classifier. We can also do some customizations for selecting C value(Cost) in Linear classifier. This can be done by inputting values in grid search. The next code snippet will show you, building & tuning of an SVM classifier with different values of C. We are going to put some values of C using expand.grid() into “grid” dataframe. Next step is to use this dataframe for testing our classifier at specific C values. It needs to be put in train() method with tuneGrid parameter. > grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5)) > set.seed(3233) > svm_Linear_Grid <- train(V14 ~., data = training, method = "svmLinear", trControl=trctrl, preProcess = c("center", "scale"), tuneGrid = grid, tuneLength = 10) > svm_Linear_Grid Support Vector Machines with Linear Kernel 210 samples 13 predictor 2 classes: '0', '1' Pre-processing: centered (13), scaled (13) Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 189, 189, 189, 189, 189, 189, ... Resampling results across tuning parameters: C Accuracy Kappa 0.00 NaN NaN 0.01 0.8222222 0.6412577 0.05 0.8285714 0.6540706 0.10 0.8190476 0.6349189 0.25 0.8174603 0.6324448 0.50 0.8126984 0.6232932 0.75 0.8142857 0.6262578 1.00 0.8158730 0.6294200 1.25 0.8158730 0.6294200 1.50 0.8158730 0.6294200 1.75 0.8126984 0.6230572 2.00 0.8126984 0.6230572 5.00 0.8126984 0.6230572 Accuracy was used to select the optimal model using the largest value. The final value used for the model was C = 0.05. > plot(svm_Linear_Grid) The above plot is showing that our classifier is giving best accuracy on C = 0.05. Let’s try to make predictions using this model for our test set. > test_pred_grid <- predict(svm_Linear_Grid, newdata = testing) > test_pred_grid [1] 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 [45] 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 [89] 1 0 Levels: 0 1 Let’s check its accuracy using confusion -matrix. > confusionMatrix(test_pred_grid, testing$V14 )
Confusion Matrix and Statistics

Reference
Prediction  0  1
0 46  5
1  6 33

Accuracy : 0.8778
95% CI : (0.7918, 0.9374)
No Information Rate : 0.5778
P-Value [Acc > NIR] : 5.854e-10

Kappa : 0.7504
Mcnemar's Test P-Value : 1

Sensitivity : 0.8846
Specificity : 0.8684
Pos Pred Value : 0.9020
Neg Pred Value : 0.8462
Prevalence : 0.5778
Detection Rate : 0.5111
Detection Prevalence : 0.5667
Balanced Accuracy : 0.8765

'Positive' Class : 0

The results of confusion matrix show that this time the accuracy on the test set is 87.78 %.

### SVM Classifier using Non-Linear Kernel

In this section, we will try to build a model using Non-Linear Kernel like Radial Basis Function. For using RBF kernel, we just need to change our train() method’s “method” parameter to “svmRadial”. In Radial kernel, it needs to select proper value of Cost “C” parameter and “sigma” parameter.

> set.seed(3233)
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
Support Vector Machines with Radial Basis Function Kernel

210 samples
13 predictor
2 classes: '0', '1'

Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ...
Resampling results across tuning parameters:

C       Accuracy   Kappa
0.25  0.8206349  0.6380027
0.50  0.8174603  0.6317534
1.00  0.8111111  0.6194915
2.00  0.7888889  0.5750201
4.00  0.7809524  0.5592617
8.00  0.7714286  0.5414119
16.00  0.7603175  0.5202125
32.00  0.7301587  0.4598166
64.00  0.7158730  0.4305807
128.00  0.6984127  0.3966326

Tuning parameter 'sigma' was held constant at a value of 0.04744793
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.04744793 and C = 0.25.
> plot(svm_Radial)

It’s showing that final sigma parameter’s value is 0.04744793 & C parameter’s value as 0.25. Let’s try to test our model’s accuracy on our test set. For predicting, we will use predict() with model’s parameters as svm_Radial & newdata= testing.

> test_pred_Radial <- predict(svm_Radial, newdata = testing)
> confusionMatrix(test_pred_Radial, testing$V14 ) Confusion Matrix and Statistics Reference Prediction 0 1 0 47 6 1 5 32 Accuracy : 0.8778 95% CI : (0.7918, 0.9374) No Information Rate : 0.5778 P-Value [Acc > NIR] : 5.854e-10 Kappa : 0.7486 Mcnemar's Test P-Value : 1 Sensitivity : 0.9038 Specificity : 0.8421 Pos Pred Value : 0.8868 Neg Pred Value : 0.8649 Prevalence : 0.5778 Detection Rate : 0.5222 Detection Prevalence : 0.5889 Balanced Accuracy : 0.8730 'Positive' Class : 0 We are getting an accuracy of 87.78%. So, in this case with values of C=0.25 & sigma= 0.04744793, we are getting good results. Let’s try to test & tune our classifier with different values of C & sigma. We will use grid search to implement this. grid_radial dataframe will hold values of sigma & C. Value of grid_radial will be given to train() method’s tuneGrid parameter. > grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04, 0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9), C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2,5))> set.seed(3233) > svm_Radial_Grid <- train(V14 ~., data = training, method = "svmRadial", trControl=trctrl, preProcess = c("center", "scale"), tuneGrid = grid_radial, tuneLength = 10) > svm_Radial_Grid Support Vector Machines with Radial Basis Function Kernel 210 samples 13 predictor 2 classes: '0', '1' Pre-processing: centered (13), scaled (13) Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 189, 189, 189, 189, 189, 189, ... Resampling results across tuning parameters: sigma C Accuracy Kappa 0.000 0.00 NaN NaN 0.000 0.01 0.5238095 0.000000000 0.000 0.05 0.5238095 0.000000000 0.000 0.10 0.5238095 0.000000000 0.000 0.25 0.5238095 0.000000000 0.000 0.50 0.5238095 0.000000000 0.000 0.75 0.5238095 0.000000000 0.000 1.00 0.5238095 0.000000000 0.000 1.50 0.5238095 0.000000000 0.000 2.00 0.5238095 0.000000000 0.000 5.00 0.5238095 0.000000000 0.010 0.00 NaN NaN 0.010 0.01 0.5238095 0.000000000 0.010 0.05 0.5238095 0.000000000 0.010 0.10 0.7857143 0.563592267 0.010 0.25 0.8222222 0.640049451 0.010 0.50 0.8222222 0.641110091 0.010 0.75 0.8222222 0.641137925 0.010 1.00 0.8222222 0.641136734 0.010 1.50 0.8206349 0.637911063 0.010 2.00 0.8206349 0.637911063 0.010 5.00 0.8158730 0.628496247 0.020 0.00 NaN NaN 0.020 0.01 0.5238095 0.000000000 0.020 0.05 0.6984127 0.377124781 0.020 0.10 0.8222222 0.639785597 0.020 0.25 0.8222222 0.640905610 0.020 0.50 0.8222222 0.641137925 0.020 0.75 0.8222222 0.641020628 0.020 1.00 0.8238095 0.644274433 0.020 1.50 0.8174603 0.631778489 0.020 2.00 0.8158730 0.628872638 0.020 5.00 0.7936508 0.584875363 0.025 0.00 NaN NaN 0.025 0.01 0.5238095 0.000000000 0.025 0.05 0.7523810 0.491454156 0.025 0.10 0.8222222 0.639930758 0.025 0.25 0.8206349 0.637798181 0.025 0.50 0.8206349 0.637882929 0.025 0.75 0.8222222 0.641020628 0.025 1.00 0.8253968 0.647704077 0.025 1.50 0.8142857 0.625619612 0.025 2.00 0.8111111 0.619606847 0.025 5.00 0.7873016 0.571968346 0.030 0.00 NaN NaN 0.030 0.01 0.5238095 0.000000000 0.030 0.05 0.7682540 0.525466210 0.030 0.10 0.8206349 0.637002421 0.030 0.25 0.8190476 0.634747262 0.030 0.50 0.8222222 0.641020628 0.030 0.75 0.8206349 0.637999463 0.030 1.00 0.8158730 0.628700648 0.030 1.50 0.8142857 0.625854896 0.030 2.00 0.8079365 0.613267389 0.030 5.00 0.7888889 0.575045900 0.040 0.00 NaN NaN 0.040 0.01 0.5238095 0.000000000 0.040 0.05 0.7825397 0.555882207 0.040 0.10 0.8238095 0.643780156 0.040 0.25 0.8174603 0.631462266 0.040 0.50 0.8190476 0.634861764 0.040 0.75 0.8174603 0.631866717 0.040 1.00 0.8174603 0.632128942 0.040 1.50 0.8063492 0.610017781 0.040 2.00 0.7936508 0.584994041 0.040 5.00 0.7793651 0.556410718 0.050 0.00 NaN NaN 0.050 0.01 0.5238095 0.000000000 0.050 0.05 0.7746032 0.539572451 0.050 0.10 0.8222222 0.640843896 0.050 0.25 0.8206349 0.638002663 0.050 0.50 0.8174603 0.631665417 0.050 0.75 0.8158730 0.628641594 0.050 1.00 0.8095238 0.616296629 0.050 1.50 0.7936508 0.585141338 0.050 2.00 0.7904762 0.578416665 0.050 5.00 0.7746032 0.546944215 0.060 0.00 NaN NaN 0.060 0.01 0.5238095 0.000000000 0.060 0.05 0.7539683 0.495857444 0.060 0.10 0.8222222 0.640843896 0.060 0.25 0.8174603 0.631610502 0.060 0.50 0.8111111 0.619201644 0.060 0.75 0.8095238 0.616267035 0.060 1.00 0.8031746 0.603625492 0.060 1.50 0.7873016 0.572031467 0.060 2.00 0.7968254 0.591535559 0.060 5.00 0.7793651 0.556715449 0.070 0.00 NaN NaN 0.070 0.01 0.5238095 0.000000000 0.070 0.05 0.7396825 0.465403075 0.070 0.10 0.8222222 0.640784261 0.070 0.25 0.8158730 0.628501328 0.070 0.50 0.8079365 0.612927598 0.070 0.75 0.8047619 0.607028961 0.070 1.00 0.7920635 0.581705463 0.070 1.50 0.7904762 0.578630216 0.070 2.00 0.7952381 0.588103221 0.070 5.00 0.7793651 0.557723627 0.080 0.00 NaN NaN 0.080 0.01 0.5238095 0.000000000 0.080 0.05 0.7142857 0.411697289 0.080 0.10 0.8206349 0.637529265 0.080 0.25 0.8095238 0.616007758 0.080 0.50 0.8079365 0.613071572 0.080 0.75 0.7984127 0.594146715 0.080 1.00 0.7904762 0.578715332 0.080 1.50 0.7984127 0.594647071 0.080 2.00 0.7904762 0.578689209 0.080 5.00 0.7777778 0.554522065 0.090 0.00 NaN NaN 0.090 0.01 0.5238095 0.000000000 0.090 0.05 0.6634921 0.303570375 0.090 0.10 0.8222222 0.640490820 0.090 0.25 0.8079365 0.613539156 0.090 0.50 0.8047619 0.606555347 0.090 0.75 0.7936508 0.585021621 0.090 1.00 0.7857143 0.569156095 0.090 1.50 0.7952381 0.588371141 0.090 2.00 0.7841270 0.565873887 0.090 5.00 0.7714286 0.541592623 0.100 0.00 NaN NaN 0.100 0.01 0.5238095 0.000000000 0.100 0.05 0.6126984 0.194097781 0.100 0.10 0.8126984 0.620930145 0.100 0.25 0.8031746 0.604558785 0.100 0.50 0.8031746 0.603653188 0.100 0.75 0.7936508 0.584991621 0.100 1.00 0.7873016 0.572324436 0.100 1.50 0.7888889 0.575436285 0.100 2.00 0.7825397 0.562906611 0.100 5.00 0.7666667 0.531862324 0.250 0.00 NaN NaN 0.250 0.01 0.5238095 0.000000000 0.250 0.05 0.5238095 0.000000000 0.250 0.10 0.5238095 0.000000000 0.250 0.25 0.7428571 0.475302551 0.250 0.50 0.7666667 0.534771105 0.250 0.75 0.7539683 0.508759153 0.250 1.00 0.7603175 0.520171909 0.250 1.50 0.7444444 0.488760478 0.250 2.00 0.7460317 0.491872751 0.250 5.00 0.7412698 0.482972131 0.500 0.00 NaN NaN 0.500 0.01 0.5238095 0.000000000 0.500 0.05 0.5238095 0.000000000 0.500 0.10 0.5238095 0.000000000 0.500 0.25 0.5238095 0.000000000 0.500 0.50 0.5682540 0.098329296 0.500 0.75 0.6587302 0.299577029 0.500 1.00 0.7063492 0.414760542 0.500 1.50 0.7000000 0.402294266 0.500 2.00 0.7047619 0.412014316 0.500 5.00 0.7047619 0.412014316 0.750 0.00 NaN NaN 0.750 0.01 0.5238095 0.000000000 0.750 0.05 0.5238095 0.000000000 0.750 0.10 0.5238095 0.000000000 0.750 0.25 0.5238095 0.000000000 0.750 0.50 0.5269841 0.006951027 0.750 0.75 0.5571429 0.074479136 0.750 1.00 0.6015873 0.179522487 0.750 1.50 0.6158730 0.213036862 0.750 2.00 0.6174603 0.217499695 0.750 5.00 0.6158730 0.214090454 0.900 0.00 NaN NaN 0.900 0.01 0.5238095 0.000000000 0.900 0.05 0.5238095 0.000000000 0.900 0.10 0.5238095 0.000000000 0.900 0.25 0.5238095 0.000000000 0.900 0.50 0.5238095 0.000000000 0.900 0.75 0.5444444 0.045715223 0.900 1.00 0.5444444 0.055590646 0.900 1.50 0.5698413 0.111517087 0.900 2.00 0.5698413 0.111517087 0.900 5.00 0.5698413 0.111517087 Accuracy was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.025 and C = 1. >plot(svm_Radial_Grid) Awesome, we ran our SVM-RBF kernel. It calculated variations and gave us best values of sigma & C. It’s telling us that best values of sigma= 0.025 & C=1 Let’s check our trained models’ accuracy on the test set. > test_pred_Radial_Grid <- predict(svm_Radial_Grid, newdata = testing) > > confusionMatrix(test_pred_Radial_Grid, testing$V14 )
Confusion Matrix and Statistics

Reference
Prediction  0  1
0 46  6
1  6 32

Accuracy : 0.8667
95% CI : (0.7787, 0.9292)
No Information Rate : 0.5778
P-Value [Acc > NIR] : 2.884e-09

Kappa : 0.7267
Mcnemar's Test P-Value : 1

Sensitivity : 0.8846
Specificity : 0.8421
Pos Pred Value : 0.8846
Neg Pred Value : 0.8421
Prevalence : 0.5778
Detection Rate : 0.5111
Detection Prevalence : 0.5778
Balanced Accuracy : 0.8634

'Positive' Class : 0

For our svm_Radial_Grid classifier, it’s giving an accuracy of 86.67%. So, it shows Radial classifier is not giving better results as compared to Linear classifier even after tuning it. It may be due to overfitting.

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

### Related Courses:

Do check out unlimited data science courses

 Title & links Details What You Will Learn Machine Learning A-Z: Hands-On Python & R In Data Science Students Enrolled :: 19,359 Course Overall Rating:: 4.6 Master Machine Learning on Python & R Make robust Machine Learning models. Handle specific topics like Reinforcement Learning, NLP and Deep Learning. Build an army of powerful Machine Learning models and know how to combine them to solve any problem. R Programming A-Z: R For Data Science With Real Exercises! Students Enrolled :: 12,001 Course Overall Rating:: 4.6 Program in R at a good level. Learn the core principles of programming. Understand the Normal distribution. Practice working with statistical, financial and sport data in R Data Mining with R: Go from Beginner to Advanced! Students Enrolled :: 2,380 Course Overall Rating:: 4.2 Use R software for data import and export, data exploration and visualization, and for data analysis tasks, including performing a comprehensive set of data mining operations. Apply the dozens of included “hands-on” cases and examples using real data and R scripts to new and unique data analysis and data mining problems. Effectively use a number of popular, contemporary data mining methods and techniques in demand by industry including: (1) Decision, classification and regression trees (CART); (2) Random forests; (3) Linear and logistic regression; and (4) Various cluster analysis techniques.

The post Support Vector Machine Classifier Implementation in R with caret package appeared first on Dataaspirant.

### Avoiding Common Mistakes with Time Series Analysis

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

If you work with data, throughout your career you’ll probably have to re-learn it several times. But you often see the principle demonstrated with a graph like this:

One line is something like a stock market index, and the other is an (almost certainly) unrelated time series like “Number of times Jennifer Lawrence is mentioned in the media.” The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.

The correlation passes a statistical test. This is a great example of mistaking correlation for causality, right? Well, no, not really: it’s actually a time series problem analyzed poorly, and a mistake that could have been avoided. You never should have seen this correlation in the first place.

The more basic problem is that the author is comparing two trended time series. The rest of this post will explain what that means, why it’s bad, and how you can avoid it fairly simply. If any of your data involves samples taken over time, and you’re exploring relationships between the series, you’ll want to read on.

### Two random series

There are several ways of explaining what’s going wrong. Instead of going into the math right away, let’s look at an intuitive explanation.

To begin with, we’ll create two completely random time series. Each is simply a list of 100 random numbers between -1 and +1, treated as a time series. The first time is 0, then 1, etc., on up to 99. We’ll call one series Y1 (the Dow-Jones average over time) and the other Y2 (the number of Jennifer Lawrence mentions). Here they are graphed:

There is no point staring at these carefully. They are random. The graphs and your intuition should tell you they are unrelated and uncorrelated. But as a test, the correlation (Pearson’s R) between Y1 and Y2 is -0.02, which is very close to zero. There is no significant relationship between them. As a second test, we do a linear regression of Y1 on Y2 to see how well Y2 can predict Y1. We get a Coefficient of Determination (R2 value) of .08 — also extremely low. Given these tests, anyone should conclude there is no relationship between them.

Now let’s tweak the time series by adding a slight rise to each. Specifically, to each series we simply add points from a slightly sloping line from (0,-3) to (99,+3). This is a rise of 6 across a span of 100. The sloping line looks like this:

Now we’ll add each point of the sloping line to the corresponding point of Y1 to get a slightly sloping series like this:

We’ll add the same sloping line to Y2:

Now let’s repeat the same tests on these new series. We get surprising results: the correlation coefficient is 0.96 — a very strong unmistakable correlation. If we regress Y on X we get a very strong R2 value of 0.92. The probability that this is due to chance is extremely low, about 1.3×10-54. These results would be enough to convince anyone that Y1 and Y2 are very strongly correlated!

What’s going on? The two time series are no more related than before; we simply added a sloping line (what statisticians call trend). One trended time series regressed against another will often reveal a strong, but spurious, relationship.

Put another way, we’ve introduced a mutual dependency. By introducing a trend, we’ve made Y1 dependent on X, and Y2 dependent on X as well. In a time series, X is time. Correlating Y1 and Y2 will uncover their mutual dependence — but the correlation is really just the fact that they’re both dependent on X. In many cases, as with Jennifer Lawrence’s popularity and the stock market index, what you’re really seeing is that they both increased over time in the period you’re looking at. This is sometimes called secular trend.

The amount of trend determines the effect on correlation. In the example above, we needed to add only a little trend (a slope of 6/100) to change the correlation result from insignificant to highly significant. But relative to the changes in the time series itself (-1 to +1), the trend was large.

A trended time series is not, of course, a bad thing. When dealing with a time series, you generally want to know whether it’s increasing or decreasing, exhibits significant periodicities or seasonalities, and so on. But in exploring time-independent relationships between two time series, you really want to know whether variations in one series are correlated with variations in another. Trend muddies these waters and should be removed.

### Dealing with trend

There are many tests for detecting trend. What can you do about trend once you find it?

One approach is to model the trend in each time series and use that model to remove it. So if we expected Y1 had a linear trend, we could do linear regression on it and subtract the line (in other words, replace Y1 with its residuals). Then we’d do that for Y2, then regress them against each other.

There are alternative, non-parametric methods that do not require modeling. One such method for removing trend is called first differences. With first differences, you subtract from each point the point that came before it:

y'(t) = y(t) – y(t-1)

Another approach is called link relatives. Link relatives are similar, but they divide each point by the point that came before it:

y'(t) = y(t) / y(t-1)

### More examples

Once you’re aware of this effect, you’ll be surprised how often two trended time series are compared, either informally or statistically. Tyler Vigen created a web page devoted to spurious correlations, with over a dozen different graphs. Each graph shows two time series that have similar shapes but are unrelated (even comically irrelevant). The correlation coefficient is given at the bottom, and it’s usually high.

How many of these relationships survive de-trending? Fortunately, Vigen provides the raw data so we can perform the tests. Some of the correlations drop considerably after de-trending. For example, here is a graph of US Crude Oil Imports from Venezuela vs Consumption of High Fructose Corn Syrup:

The correlation of these series is 0.88. Now here are the time series after first-differences de-trending:

These time series look much less related, and indeed the correlation drops to 0.24.

A blog post from Alex Jones, more tongue-in-cheek, attempts to link his company’s stock price with the number of days he worked at the company. Of course, the number of days worked is simply the time series: 1, 2, 3, 4, etc. It is a steadily rising line — pure trend! Since his company’s stock price decreased over time, of course he found correlation. In fact, every manipulation of the two variables he performed was simply another way of quantifying the trend in company price.

### Final words

I was first introduced to this problem long ago in a job where I was investigating equipment failures as a function of weather. The data I had were taken over six months, winter into summer. The equipment failures rose over this period (that’s why I was investigating). Of course, the temperature rose as well. With two trended time series, I found strong correlation. I thought I was onto something until I started reading more about time series analysis.

Trends occur in many time series. Before exploring relationships between two series, you should attempt to measure and control for trend. But de-trending is not a panacea because not all spurious correlation are caused by trends. Even after de-trending, two time series can be spuriously correlated. There can remain patterns such as seasonality, periodicity, and autocorrelation. Also, you may not want to de-trend naively with a method such as first differences if you expect lagged effects.

Any good book on time series analysis should discuss these issues. My go-to text for statistical time series analysis is Quantitative Forecasting Methods by Farnum and Stanton (PWS-KENT, 1989). Chapter 4 of their book discusses regression over time series, including this issue.

The post Avoiding Common Mistakes with Time Series Analysis appeared first on Silicon Valley Data Science.

### An internal validation leaderboard in Neptune

Internal validation is a useful tool for comparing results of experiments performed by team members in any business or research task. It can also be a valuable complement of public leaderboards attached to machine learning competitions on platforms like Kaggle.

In this post, we present how to build an internal validation leaderboard using Python scripts and the Neptune environment. As an example of a use case, we will take the well known classification dataset CIFAR-10. We study it using a deep convolutional neural network provided in the TensorFlow tutorial.

Whenever we solve the same problem in many ways, we want to know which way is the best. Therefore we validate, compare and order the solutions. In this way, we naturally create the ranking of our solutions – the leaderboard.

We usually care about the privacy of our work. We want to keep the techniques used and the results of our experiments confidential. Hence, our validation should remain undisclosed as well – it should be internal.

If we keep improving the models and produce new solutions at a fast pace, at some point we are no longer able to manage the internal validation leaderboard manually. Then we need a tool which will do that for us automatically and will present the results to us in a readable form.

In any business or research project you are probably interested in the productivity of team members. You would like to know who and when submits his or her solution to the problem, what kind of model they use and how good the solution is.

A good internal leaderboard stores all that information. It also allows you to search for submissions sent by specific user, defined in some time window or using a particular model. Finally, you can sort the submissions with respect to the accuracy metric to find the best one.

### Machine learning competitions

The popular machine learning platform, Kaggle, offers a readable public leaderboard for every competition. Each contestant can follow his position in the ranking and try to improve several times a day.

However, an internal validation would be very useful for every competing team. A good internal leaderboard has many advantages over a public one:

• the results remain exclusive,
• there is no limit on the number of daily submissions,
• metrics other than those chosen by the competition organizers can be evaluated as well,
• the submissions can be tagged, for example to indicate the used model.

Note that in every official competition the ground truth labels for the test data are not provided. Hence, to produce the internal validation we are forced to split the available public training data. One part is used to tune the model, the other is needed to evaluate it internally. This division can be an origin of unexpected problems (e.g., data leaks) so perform it carefully!

## Why Neptune?

Neptune was designed to manage multiple experiments. Among many features, it supports storing parameters, logs and metric values from various experiment executions. The results are accessible through an aesthetic Web UI.
In Neptune you can:

• gather experiments from various projects in groups,
• add tags to experiments and filter by them,
• sort experiments by users, date of creation, or – most importantly for us – by metric values.

Due to that, Neptune is a handy tool for creating an internal validation leaderboard for your team.

## Let’s do it!

Let’s build an exemplary internal validation leaderboard in Neptune.

### CIFAR-10 dataset

We use the well-known classification dataset CIFAR-10. Every image in this dataset is a member of one of 10 classes, labeled by numbers from 0 to 9. Using the train data we build a model which allows us to predict the labels of images from test data. CIFAR-10 is designed for educational purposes, therefore the ground truth labels for test data are provided.

### Evaluating functions

Let’s fix the notation:

• $$N$$ – number of images we have to classify.
• $$c_i$$ – class to which the $$i$$th image belongs; $$i\in\{0,\ldots,N-1\}$$, $$c_i\in\{0,\ldots,9\}$$.
• $$p_{ij}$$ – estimated probability that the $$i$$th image belongs to the class $$j$$; $$i\in\{0,\ldots,N-1\}$$, $$j\in\{0,\ldots,9\}$$, $$p_{ij}\in[0,1]$$.

We evaluate our submission with two metrics. The first metric is the classification accuracy given by

$\frac 1N\sum_{i=0}^{N-1}\mathbb{1}\Big(\arg\max_j p_{ij}=c_i\Big).$

This is the percentage of labels that are predicted correctly. We would like to maximize it, the optimal value is 1. The second metric is the average cross entropy given by

$-\frac 1N\sum_{i=0}^{N-1}\log p_{ic_i}.$

This formula is simpler than the principal entropy since the classes are completely mutually exclusive. We would like to minimize it, preferably to 0.

### Implementation details

#### Prerequisites

To run the code we provide you need the following software:

#### Repository

The code we use is based on that available in the TensorFlow convolutional neural networks tutorial. You can download our code from our GitHub repository. It consists of the following files:

File Purpose
main.py The script to execute.
cifar10_submission.py Computes submission for a CIFAR-10 model.
evaluation.py Contains functions required to create the leaderboard in Neptune.
config.yaml Neptune configuration file.

#### Description

When you run

main.py
, you first train a neural network using function
cifar10_train
provided by TensorFlow. We hard-coded the number of training steps. This could be enhanced to dynamic using Neptune action, but for the sake of brevity we skip this topic in the blog post. Due to TensorFlow Integration you can track the tuning of the network in Neptune. Moreover, the parameters of the tuned network are stored in a file manageable by TensorFlow saver objects.

Then function

cifar10_submission
is called. It restores parameters of the network from the file created by
cifar10_train
. Next, it forward-propagates the images from the test set through the network to obtain a submission. The submission is stored as a Python Numpy array
submission
of the shape $$N\times 10$$, the $$i$$th row contains estimated probabilities $$p_{i0},\ldots,p_{i9}$$. The ground truth labels forms a Python Numpy array
true_labels
of the shape $$N\times 1$$, the $$i$$th row contains label $$c_i$$.

Ultimately, for given

submission
and
true_labels
arrays function
evaluate_and_send_to_neptune
from script
evaluation.py
computes metric values and sends them to Neptune.

File

config.yaml
is a Neptune job configuration file, essential for running Neptune jobs. Please download all the files and place them in the same folder.

### Step by step

We create a validation leaderboard in Neptune in 4 easy steps:

#### 1. Creating a Neptune group

We create the Neptune group where all the submissions will be stored. We do this as follows:

1. Enter the Neptune home screen.
2. Click in the lower left corner, enter the name “CIFAR-10 leaderboard”, click again.
3. Choose “project” “is” and type “CIFAR-10”, click “Apply”.

Our new group appears in the left column. We can edit or delete it by clicking the icon next to the group name.

#### 2. Creating an evaluation module

We created the module

evaluation.py
consisting of 5 functions:

1. _evaluate_accuracy
and
_evaluate_cross_entropy
compute the respective metrics,
2. _prepare_neptune
adds tags to the Neptune job (if specified – see Step 4) and create Neptune channels to send evaluated metrics,
3. _send_to_neptune
sends metrics to channels,
4. evaluate_and_send_to_neptune
calls the above functions.

You can easily adapt this script to evaluate and send any other metrics.

#### 3. Sending submissions to Neptune

To place our submissions in the Neptune group, we need to specify

project: CIFAR-10
in a Neptune config file
config.yaml
. This is a three-line-long file, it also contains project name and a description.

Now we are ready to send our results to the leaderboard created in Neptune! Assume that all the files are placed in the folder named

leaderboard
. We run the script
main.py
from the folder above by typing

neptune run leaderboard/main.py --config leaderboard/config.yaml --dump-dir-url leaderboard/dump --paths-to-dump leaderboard

using Neptune CLI. The script executes for about half an hour on a modern laptop. Training would be significantly faster on a GPU.

There are only 5 lines related to Neptune in the

main.py
script. First we load the library:

from deepsense import neptune

Then we initialize a Neptune context:

ctx = neptune.Context()

Next, command

ctx.integrate_with_tensorflow()

automatically creates and manages Neptune channels related to TensorFlow SummaryWriter objects. Thereby, we can observe the progress of our network in the Neptune Dashboard. Finally, in lines

tags = ["tensorflow", "tutorial"]
evaluation.evaluate_and_send_to_neptune(submission, true_labels, ctx, tags)

we evaluate our submission and send metric values to dedicated Neptune channels.

tags
is a list of tags which we can add to the Neptune job. In this way, we attach some keywords to the Neptune job. We can easily filter jobs by tags in the Neptune Web UI.

#### 4. Customizing a view in Neptune’s Web UI

If the job has been successfully executed, we can see our submission in the Neptune group we created. One more thing worth doing is setting up the view of columns.

1. Click “Show/hide columns” in the upper part of the Neptune Web UI.
2. Check/uncheck the names. You should:
• uncheck “Project” since all the submissions in this group come from the same project CIFAR-10,
• check channel names “accuracy” and “cross entropy” because you want to sort with respect to them.

You can sort submissions by accuracy or cross entropy value by clicking the triangle over the respective column.

## Summary

That’s all! Now your internal validation leaderboard in Neptune is all set up. You and your team members can compare your models tuned up to the CIFAR-10 dataset. You can also filter your results by dates, users or custom tags.

Of course, CIFAR-10 is not the only possible application of the provided code. You can easily adapt it for other applications like: contests, research or business intelligence. Feel free to use an internal validation leaderboard in Neptune wherever and whenever you need.

The post An internal validation leaderboard in Neptune appeared first on deepsense.io.

### Data Science of Sales Calls: 3 Actionable Findings

How does AI help sales and marketing teams in the organisation? Let's understand Dos and don'ts of sales calls with the help of analysis of over 70,000+ B2B SaaS sales calls.

### Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1

Spark Summit will be held in Boston on Feb 7–9, 2017. Check out the full agenda and get your ticket before it sells out!

Try this notebook in Databricks

We are well into the Big Data era, with organizations collecting massive amounts of data on a continual basis. Yet, the value of this data deluge hinges on the ability to extract actionable insights in a timely fashion. Hence, there is an increasing need for continuous applications that can derive real-time actionable insights from massive data ingestion pipelines.

However, building production-grade continuous applications can be challenging, as developers need to overcome many obstacles, including:

• Providing end-to-end reliability and correctness guarantees – Long running data processing systems must be resilient to failures by ensuring that outputs are consistent with results processed in batch. Additionally, unusual activities (e.g failures in upstream components, traffic spikes, etc.) must be continuously monitored and automatically mitigated to ensure highly available insights are delivered in real-time.
• Performing complex transformations – Data arrives in a myriad formats (CSV, JSON, Avro, etc.) that often must be restructured, transformed and augmented before being consumed. Such restructuring requires that all the traditional tools from batch processing systems are available, but without the added latencies that they typically entail.
• Handling late or out-of-order data – When dealing with the physical world, data arriving late or out-of-order is a fact of life. As a result, aggregations and other complex computations must be continuously (and accurately) revised as new information arrives.
• Integrating with other systems – Information originates from a variety of sources (Kafka, HDFS, S3, etc), which must be integrated to see the complete picture.

Structured Streaming in Apache Spark builds upon the strong foundation of Spark SQL, leveraging its powerful APIs to provide a seamless query interface, while simultaneously optimizing its execution engine to enable low-latency, continually updated answers. This blog post kicks off a series in which we will explore how we are using the new features of Apache Spark 2.1 to overcome the above challenges and build our own production pipelines.

In this first post, we will focus on an ETL pipeline that converts raw AWS CloudTrail audit logs into a JIT data warehouse for faster ad-hoc queries. We will show how easy it is to take an existing batch ETL job and subsequently productize it as a real-time streaming pipeline using Structured Streaming in Databricks. Using this pipeline, we have converted 3.8 million JSON files containing 7.9 billion records into a Parquet table, which allows us to do ad-hoc queries on updated-to-the-minute Parquet table 10x faster than those on raw JSON files.

## The Need for Streaming ETL

Extract, Transform, and Load (ETL) pipelines prepare raw, unstructured data into a form that can be queried easily and efficiently. Specifically, they need to be able to do the following:

• Filter, transform, and clean up data – Raw data is naturally messy and needs to be cleaned up to fit into a well-defined structured format. For example, parsing timestamp strings to date/time types for faster comparisons, filtering corrupted data, nesting/unnesting/flattening complex structures to better organize important columns, etc.
• Convert to a more efficient storage format – Text, JSON and CSV data are easy to generate and are human readable, but are very expensive to query. Converting it to more efficient formats like Parquet, Avro, or ORC can reduce file size and improve processing speed.
• Partition data by important columns – By partitioning the data based on the value of one or more columns, common queries can be answered more efficiently by reading only the relevant fraction of the total dataset.

Traditionally, ETL is performed as periodic batch jobs. For example, dump the raw data in real time, and then convert it to structured form every few hours to enable efficient queries. We had initially setup our system this way, but this technique incurred a high latency; we had to wait for few hours before getting any insights. For many use cases, this delay is unacceptable. When something suspicious is happening in an account, we need to be able to ask questions immediately. Waiting minutes to hours could result in an unreasonable delay in responding to an incident.

Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.

## Transforming Raw Logs with Structured Streaming

We start by defining the schema of the JSON records based on CloudTrail documentation.

val cloudTrailSchema = new StructType()
// ...


See the attached notebook for the full schema. With this, we can define a streaming DataFrame that represents the data stream from CloudTrail files that are being written in a S3 bucket.

val rawRecords = spark.readStream
.schema(cloudTrailSchema)
.json("s3n://mybucket/AWSLogs/*/CloudTrail/*/2017/*/*")


A good way to understand what this rawRecords DataFrame represents is to first understand the Structured Streaming programming model. The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.

This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data. In this case, we will transform the raw JSON data such that it’s easier to query using Spark SQL’s built-in support for manipulating complex nested schemas. Here is an abridged version of the transformation.

val cloudtrailEvents = rawRecords
.select(explode($"records") as 'record) .select( unix_timestamp($"record.eventTime",
"yyyy-MM-dd'T'hh:mm:ss").cast("timestamp") as 'timestamp, $"record.*")  Here, we explode (split) the array of records loaded from each file into separate records. We also parse the string event time string in each record to Spark’s timestamp type, and flatten out the nested columns for easier querying. Note that if cloudtrailEvents was a batch DataFrame on a fixed set of files, then we would have written the same query, and we would have written the results only once as parsed.write.parquet("/cloudtrail"). Instead, we will start a StreamingQuery that runs continuously to transform new data as it arrives. val streamingETLQuery = cloudtrailEvents .withColumn("date",$"timestamp".cast("date") // derive the date
.writeStream
.trigger(ProcessingTime("10 seconds")) // check for files every 10s
.format("parquet") // write as Parquet partitioned by date
.partitionBy("date")
.option("path", "/cloudtrail")
.option("checkpointLocation", "/cloudtrail.checkpoint/")
.start()


Here we are specifying the following configurations for the StreamingQuery before starting it.

• Derive the date from the timestamp column
• Check for new files every 10 seconds (i.e., trigger interval)
• Write the transformed data from parsed DataFrame as a Parquet-formatted table at the path /cloudtrail.
• Partition the Parquet table by date so that we can later efficiently query time slices of the data; a key requirement in monitoring applications.
• Save checkpoint information at the path /checkpoints/cloudtrail for fault-tolerance (explained later in the blog)

In terms of the Structured Streaming Model, this is how the execution of this query is performed.

Conceptually, the rawRecords DataFrame is an append-only Input Table, and the cloudtrailEvents DataFrame is the transformed Result Table. In other words, when new rows are appended to the input (rawRecords), the result table (cloudtrailEvents) will have new transformed rows. In this particular case, every 10 seconds, Spark SQL engine triggers a check for new files. When it finds new data (i.e., new rows in the Input Table), it transforms the data to generate new rows in the Result Table, which then get written out as Parquet files.

Furthermore, while this streaming query is running, you can use Spark SQL to simultaneously query the Parquet table. The streaming query writes the Parquet data transactionally such that concurrent interactive query processing will always see a consistent view of the latest data. This strong guarantee is known as prefix-integrity and it makes Structured Streaming pipelines integrate nicely with the larger Continuous Application.

You can read more details about the Structured Streaming model, and its advantages over other streaming engines in our previous blog.

## Solving Production Challenges

Earlier, we highlighted a number of challenges that must be solved for running a streaming ETL pipeline in production. Let’s see how Structured Streaming running on the Databricks platform solves them.

### Recovering from Failures to get Exactly-once Fault-tolerance Guarantees

Long running pipelines must be able to tolerate machine failures. With Structured Streaming, achieving fault-tolerance is as easy as specifying a checkpoint location for the query. In the earlier code snippet, we did so in the following line.

.option("checkpointLocation", "/cloudtrail.checkpoint/")


This checkpoint directory is per query, and while a query is active, Spark continuously writes metadata of the processed data to the checkpoint directory. Even if the entire cluster fails, the query can be restarted on a new cluster, using the same checkpoint directory, and consistently recover. More specifically, on the new cluster, Spark uses the metadata to start the new query where the failed one left off, thus ensuring end-to-end exactly-once guarantees and data consistency (see Fault Recovery section of our previous blog).

Furthermore, this same mechanism allows you to upgrade your query between restarts, as long as the input sources and output schema remain the same. Since Spark 2.1, we encode the checkpoint data in JSON for future-proof compatibility. So you can restart your query even after updating your Spark version. In all cases, you will get the same fault-tolerance and consistency guarantees.

Note that Databricks makes it very easy to set up automatic recovery, as we will show in the next section.

For a Continuous Application to run smoothly, it must be robust to individual machine or even whole cluster failures. In Databricks, we have developed tight integration with Structured Streaming that allows us continuously monitor your StreamingQueries for failures (and automatically restart them. All you have to do is create a new Job, and configure the Job retry policy. You can also configure the job to send emails to notify you of failures.

Application upgrades can be easily made by updating your code and/or Spark version and then restarting the Job. See our guide on running Structured Streaming in Production for more details.

Machine failures are not the only situations that we need to handle to ensure robust processing. We will discuss how to monitor for traffic spikes and upstream failures in more detail later in this series.

### Combining Live Data with Historical/Batch Data

Many applications require historical/batch data to be combined with live data. For example, besides the incoming audit logs, we may already have a large backlog of logs waiting to be converted. Ideally, we would like to achieve both, interactively query the latest data as soon as possible, and also have access to historical data for future analysis. It is often complex to set up such a pipeline using most existing systems as you would have to set up multiples processes: a batch job to convert the historical data, a streaming pipeline to convert the live data, and maybe a another step to combine the results.

Structured Streaming eliminates this challenge. You can configure the above query to prioritize the processing new data files as they arrive, while using the space cluster capacity to process the old files. First, we set the option latestFirst for the file source to true, so that new files are processed first. Then, we set the maxFilesPerTrigger to limit how many files to process every time. This tunes the query to update the downstream data warehouse more frequently, so that the latest data is made available for querying as soon as possible. Together, we can define the rawLogs DataFrame as follows:

val rawJson = spark.readStream
.schema(cloudTrailSchema)
.option("latestFirst", "true")
.option("maxFilesPerTrigger", "20")
.json("s3n://mybucket/AWSLogs/*/CloudTrail/*/2017/01/*")


In this way, we can write a single query that easily combines live data with historical data, while ensuring low-latency, efficiency and data consistency.

## Conclusion

Structured Streaming in Apache Spark is the best framework for writing your streaming ETL pipelines, and Databricks makes it easy to run them in production at scale, as we demonstrated above. We shared a high level overview of the steps—extracting, transforming, loading and finally querying—to set up your streaming ETL production pipeline. We also discussed and demonstrated how Structured Streaming overcomes the challenges in solving and setting up high-volume and low-latency streaming pipelines in production.

In the future blog posts in this series, we’ll cover how we address other hurdles, including:

• Applying complex transformations to nested JSON data
• Integrating Structured Streaming with Apache Kafka
• Computing event time aggregations with Structured Streaming

## What’s Next

You can try two notebooks with your own AWS CloudTrail Logs. Import the notebooks into Databricks.

--

The post Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 appeared first on Databricks.

### Going to War with the Giants: Automated Machine Learning with MLJAR

The performance of automated machine learning tool MLJAR on Kaggle competition data is presented in comparison with those from other predictive APIs from Amazon, Google, PredicSis and BigML.

### The big data ecosystem for science: X-ray crystallography

Diffract-and-destroy experiments to accurately determine three-dimensional structures of nano-scale systems can produce 150 TB of data per sample. We review how such Big Data is processed.

### The oceans’ UFOs pose new risks for professional sailors

GETTING to the finish line of Vendée Globe has always been hard. The planet’s only non-stop solo round-the-world sailing race—known as the “Everest of the seas”—can take months to complete, and is considered a gruelling test of mental and physical endurance.

### Homework #2

What's the next step from this? As I already mentioned, there is another way to express the variational bound in terms of a KL divergence, and if one uses logistic regression to estimate relevant log density ratios, one gets an alternative GAN-type algorithm. I already gave you a clue in the intro saying that the resulting algorithm will look a lot like ALI or BiGAN. Off you go.

Finally, thanks to Ben Poole for some comments on this draft, and for pointing out the new paper by Mescheder and colleagues.

### Building & Maintaining a Master Data Dictionary: Part Two

In the first part of this series, we explored what a master data dictionary (MDD) is, its need, the important aspects to consider when building one, and the stakeholders involved.

In this second and final part, we’ll delve into a few suggestions for the structure of the dictionary and discuss how to choose the right tool to build it in.

### Structure of the master data dictionary

Broadly, an MDD can be organized in two ways:

1. By data source: This is the method we use at Magento Business Intelligence. Each metric is categorized under the primary data source it is coming from. For example, all support metrics are under Zendesk (our support tool), all account level metrics are under Salesforce, and all time-tracking metrics are under Toggl.
2. By business function: This method should be used when metrics are often joined across multiple data sources. For example, if this method was used at Magento Business Intelligence, the categories would be “Support Metrics”, “Account Level Metrics” and “Operational Efficiency”.

Under both of the two above methods, the metrics can either be simply listed out with their respective SQL queries, or organized as a “View”. At Magento Business Intelligence, we use a combination of both methods:

1. Simple, listedout metrics: This is the simplest method to organize metrics, but can involve some work for a user when combining several metrics to create a report. For example, let’s say we’re trying to analyze support metrics, and we want to build a report that calculates the number of new support requests along with their assignment and resolution times on a weekly basis. To do so using this method, we would need to find the individual queries for “new support requests”, “assignment time” and “resolution time” and combine them as a single query when creating the report.
2. Creating views: This is a method that is very useful when using a SQL based tool such as Mode Analytics. At Magento Business Intelligence, we use Mode for a large part of our internal reporting. It has a feature called “Definitions” that allows the creation of run-time “views”, which are pre-defined SQL queries that can be referenced like tables in SQL queries. By creating views for each data source, the unnecessary underlying logic can be abstracted from the end-user. For example, the view for “Support Metrics” can contain the columns “Assignment Time”, “Resolution Time”, “Number of responses”, “Time Between Responses” and others. When using this view, the end-user will only need to SELECT the required columns from the view instead of using the underlying SQL logic to calculate these columns. This accomplishes the primary objective of master data dictionaries – maintaining consistency in definitions across organizations.

### Choosing the right tool

The choice of tool depends on the structure being maintained in the master data dictionary. At Magento Business Intelligence, we use a combination of our own tool, Magento Business Intelligence (formerly RJMetrics), Wiki, and Mode Analytics to maintain our master data dictionary. Below are the benefits of using each of these tools:

1. Magento Business Intelligence (RJMetrics): We recommend using this tool when the end user is non-technical and not familiar with SQL. Indeed, this is the primary use case of Magento Business Intelligence. It allows the creation of “metrics”, which are pre-defined queries that can be “dragged-and-dropped” to reports. The user need not know the underlying logic of the metric.
2. Wiki: When using the “simple listed out metrics” method of structuring the master data dictionary, a tool like a wiki article can be quite easy and accessible. The “contents” section of the wiki can be linked to the metrics listed in the document. Each metric can contain the required definition, explained in either simple English, SQL queries or a combination of both. An end-user accessing the wiki will need to click on the desired hyperlink at the top of the document to navigate to the required metric’s definition.
3. Mode Analytics: Mode makes it very easy to store SQL “views” using the “Definitions” feature and share reports across organizations. While it is not the ideal tool for non-technical business users to build reports in, it can be very efficient for technical users to collaborate and build reports in. It also has the capability to embed reports outside the platform, which opens up new use cases as well. At Magento Business Intelligence, we use Mode Analytics in conjunction with Stitch, the data consolidation tool. Stitch pipes our data from the different data sources to an Amazon Redshift cluster, from which Mode Analytics can read it.

Finally, the choices you make regarding the structure and tool for your master data dictionary depend on several other variables. It is possible that none of the above options are optimal for your use case. In this case, we would love to hear about how your situation is different and what you have done to build out a master data dictionary! Please feel free to reach out to us in the comments section below.

The post Building & Maintaining a Master Data Dictionary: Part Two appeared first on The Data Point.

### Embedding.js: Data-driven environments for virtual reality

Embedding.js is a work-in-progress JavaScript library by Beau Cronin that makes it more straightforward to create data-driven environments. Think virtual reality and rotating areas in the browser.

[I]t’s not just about 3D — we’ve used various depth cues in windowed visualization settings for some time, and in some cases these techniques have been put to good use. But something altogether different happens when we inhabit an environment, and in particular when our sensory inputs change immediately and predictably in response to our movements. Real-world perception is not static, but active and embodied; the core hypothesis behind embedding is that data-driven environments can deliver greater understanding to the degree to which they leverage the mechanisms of exploration and perception that we use, effortlessly, in going about our daily lives.

A case where virtual immersion led to greater understanding doesn’t come to mind right away, but maybe ease-of-use is a step towards getting there.

Tags: ,

### Good fences (between data science and production) make good neighbors

What data scientists need to know about production—and what production should expect from their data scientists.

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.

The above illustration shows a perfect world for the scoring team. They have some data type A that they can analyze—the set of features that your software has observed. A can be a JSON message describing a user, or A can be a Protocol Buffer describing a transaction, or an Avro message describing an item. They then have a model that performs some task—churn prediction, chargeback probability, etc. They then get a result that they use to continue processing.

In order to achieve this goal, the data science team has to have tooling that does at least two things at score time:

1. Create a model that can ingest and return the expected native data type
2. Be able to supply an external representation of a model

In order to address the first issue, we have to realize that the act of featurization must be embedded in the model. Not only does that make the scoring team’s job easier, it also removes a potential source of error, namely feature functions that are different at train and score time. For instance, if the data science team works in R and takes data from a database to make a model, but the scoring team works in Java, then R feature functions will have to be reimplemented at score time in Java. The below training diagram shows what a supervised training architecture might look like using generic types.

While this looks complicated, it should seem pretty familiar to most data scientists after breaking it down. First, we start with training data of some type we’ll call A. This is historical data, such as users at specific times. As a side note, it is incredibly important to make sure that the historical data is time-bounded to avoid information leak. We then take those training examples and featurize them through feature functions we have constructed. These functions all convert a single data type A into a single data type T. Type T is frequently either a scalar value forming a CSV file or a sequence of values for input formats such as LibSVM or Vowpal Wabbit. In both cases, a single List[T] is natively understandable by a machine learning library as one row of featurized data. We will then have a separate list List[G] of observations of ground truth G that we have to join to the featurized training data List[List[T]]. The result is input data that is legible to our machine learning library, whose format per row is (G, List[T]).

Once we have a List[(G, List[T])], we can use any supervised learning framework to train a model. The output of this is a machine learning model, a function defined as List[T] => O. O is the native output type of the model used, which may or may not be the desired output type of the whole model—the output defined by your business needs. A good example of this would be a segmentation model where we desire to classify users as either highly likely to churn, somewhat likely to churn, or unlikely to churn. The output type desired by the calling code is an enum, but the model itself may output a float. We will then use a finalizer to convert that float into an enum through simple segmentation. In the case where the output type of the native model is the same as the type desired, the identity function may be used as the finalizer.

One important caveat to note is that the machine learning model List[T] => O is in the native format of the library used to learn it, such as a Vowpal Wabbit model or H2O. These dependencies are now needed by the whole model. Traditionally, this is a pain point for most machine learning systems in production, as it locks in a specific learning framework that is then difficult to change in the future. In our formulation, however, the types needed by the framework-specific model are not exposed to the calling code. Because of this, the framework-specific model can be swapped out at any time without changing the type contract guaranteed at score time. This is accomplished through the use of function composition. This is a huge win for both the data science team as well as the scoring team. It allows the data science team to be flexible and use the library that best solves each individual problem. It also allows easy version upgrades of specific libraries without fear of breaking models. It makes life easier for the scoring team, too, as they don’t have to fret over understanding any of the machine learning frameworks and can instead focus on scale and reliability.

This work is so important to me that I led a team that has open-sourced these ideas as a project called Aloha. Aloha is an implementation of many of these ideas within a Scala DSL. Aloha is supported by a community of production data scientists, lead by Ryan Deak, the main author of the project. This project has received commercial support from both eHarmony and ZEFR, and is currently under active development. Aloha has streamlined the model deployment process and reduced production error rates in multiple deployment environments to date.

A nice benefit to controlling the featurization layer is that we can place a QA engine within the model itself. In the future, we plan to be able to add arbitrary QA tests to model inputs and take an action (such as an email) if such conditions are not satisfied. For instance, if we observe a feature is present in 90% of examples at train time and then through the use of a sliding window see that it is only present in 10% of examples at score time, then the model may perform very poorly through no fault of its own, but rather because of a data preparation issue. This information can be encoded in an Aloha model, and an action can be associated with it to trigger notification if the data drifts at score time.

Having a quick and safe path to production should be a top priority for all engineering teams, and data science is no exception. While I have seen many approaches to productionalizing data science, any of them that don’t put machine learned models directly from the data scientist's code to production fall short of realizing their full potential.

### Putting the science back in data science

Best practices and scalable workflows for reproducible data science.

One of key tenets of science (physics, chemistry, etc.), or at least the theoretical ideal of science, is reproducibility. Truly “scientific” results should not be accepted by the community unless they can be clearly reproduced and have undergone a peer review process. Of course, things get messy in practice for both academic scientists and data scientists, and many workflows employed by data scientists are far from reproducible. These workflows may take the form of:

• A series of Jupyter notebooks with increasingly descriptive names, such as second_attempt_at_feature_selection_for_part2.ipynb
• Python or R scripts manually copied to a machine and run at periodic times via cron
• Fairly robust, but poorly understood, applications built by engineers based on specifications handed off to the engineers from data scientists
• Applications producing results that are nearly impossible to tie to specific states of one or more continuously changing input data sets

At the very best, the results generated by these sorts of workflows could be re-created by the person(s) directly involved with the project, but they are unlikely to be reproduced by anyone new to the project or by anyone charged with reviewing the project.

Reproducibility-related data science woes are being expressed throughout the community:

Data analysis is incredibly easy to get wrong, and it's just as hard to know when you're getting it right, which makes reproducible research all the more important!—Reproducibility is not just for researchers, Data School

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done.—Cookiecutter Data Science

Six months later, someone asks you a question you didn't cover so you need to reproduce your analysis. But you can't remember where the hell you saved the damn thing on your computer. If you're a data scientist (especially the decision sciences/analysis focused kind), this has happened to you.—The Most Boring/Valuable Data Science Advice, by Justin Bozonier

The problem of reproducibility is one that data science teams within an organization will have to tackle at some point. However, there is good news! With a little bit of discipline and the right tooling, data science teams can achieve reproducibility. This post will discuss the value of reproducibility and will provide some practical steps toward achieving it.

One could argue that as long as your models and analyses produce “good” results, it doesn’t matter whether those results could be re-created. However, even small teams of data scientists will hit a wall if they neglect reproducibility. Reproducibility in data science shouldn’t be forsaken, regardless of the size your organization or the maturity of your company, because reproducibility is a precursor to:

Collaboration: Data science, and science in general for that matter, is a collaborative endeavor. No data scientist knows all relevant modeling techniques and analyses, and, even if they did, the size and complexity of the data-related problems in modern companies are almost always beyond the control of a single person. Thus, as a data scientist, you should always be concerned about how you share your results with your colleagues and how you collaborate on analyses/models. Specifically, you should share your work and deploy your products in a way that allows others to do exactly what you did, with the same data you used, to produce the same result. Otherwise, your team will not be able to capitalize on its collective knowledge, and advances within the team will only be advanced and understood by individuals.

Creativity: How do you know if a new model is performing better than an old model? How can you properly justify adding creative sophistication or complexity to analyses? Unfortunately, these questions are often addressed via one individual’s trial and error (e.g., in a notebook), which is lost forever after the decisions are made. If analyses are reproducible, however, data science teams can: (1) concretely determine how new analyses compare to old analyses because the old analyses can be exactly reproduced and the new analyses can be run against the known previous data; and (2) clearly see which analyses performed poorly in the past to avoid repeating mistakes.

Compliance: As more and more statistical, machine learning, and artificial intelligence applications make decisions that directly impact users, there will be more and more public pressure to explain and reproduce results. In fact, the EU is already demanding a “right to an explanation” for many algorithmically generated, user-impacting decisions. How could such an explanation be given or an audit trail be established without a clearly understood and reproducible workflow that let to the results?

## How can a data science team achieve reproducibility?

Successfully enabling reproducibility will look slightly different for every organization because data science teams are tasked with such a wide variety of projects. However, implementing some combination of the following best practices, techniques, and tooling is likely to help move your workflows closer to reproducibility.

### Strive for and celebrate simple, interpretable solutions

Deep learning is a prime example of powerful, yet often difficult to reproduce, analytical tools. Not all business problems require it, even though deep learning and other types of neural networks are clearly very powerful. Often a simple statistical aggregation (e.g., calculating a min or max) does wonders with respect to data-driven decision-making. In other cases, a linear regression or a decision tree might produce adequate, or even very good, predictions.

In these cases, the price paid in interpretability with more complicated modeling techniques might not be worth gains in precision or accuracy. The bottom line is that it is harder to ensure reproducibility for complicated data pipelines and modeling techniques, and reproducibility should be valued above using the latest and greatest models.

### No reproducibility, no deployment

No matter what time crunch you are facing, it’s not worth putting a flaky implementation of an analysis into production. As data scientists, we are working to create a culture of data-driven decision-making. If your application breaks without an explanation (likely because you are unable to reproduce the results), people will lose confidence in your application and stop making decisions based on the results of your application. Even if you eventually fix it, that confidence is very, very hard to win back.

Data science teams should require reproducibility in the same way they require unit testing, linting, code versioning, and review. Without consistently producing results as good or better than known results for known data, analyses should never be passed on to deployment. This performance can be measured via techniques similar to integration testing. Further, if possible, models can be run in parallel on current data running through your systems for a side-by-side comparison with current production models.

You can orchestrate this sort of testing and measurement on your own, but you might consider taking advantage of something like LeVar. LeVar provides “a database for storing evaluation data sets and experiments run against them, so that over time you can keep track of how your methods are doing against static, known data inputs.”

Even if you have code or Jupyter notebooks versioned, you simply can’t reproduce an analysis if you don’t run the code or notebook on the same data. This means that you need to have a plan and tooling in place to retrieve the state of both your analysis and your data at certain points in history. As time goes on, there are more and more options to enable data versioning, and they will be discussed below, but your team should settle on a plan for data versioning and stick to it. Data science prior to data versioning is a little bit like software engineering before Git.

Pachyderm is a tool I know well (disclosure: I work at Pachyderm) that allows you to commit data with versioning similar to committing code via Git. The Pachyderm file system is made up of “data repositories” into which you can commit data via files of any format.

For any manipulation of your data you can encapsulate that modification in a commit to Pachyderm. That means the operation is reversible and the new state is reproducible for you and your colleagues. Just as in Git, commits are immutable so you can always refer back to a previous state of your data.

Actually, it’s not always enough to version your data. Your data comes with its own baggage. It was generated from a series of transformations and, thus, you likely need some understanding of the “provenance” of your data. Results without context are meaningless. At every step of your analysis, you need to understand where the data came from and how it reached its current state.

Tools like Pachyderm can help us out here as well, as a tool or a model for your own processes. Analyses that are run via Pachyderm, for example, automatically record provenance as they execute and it’s impossible for analysis to take input without those inputs becoming provenance for the output.

### Write it down

Call it documentation if you want. In any event, your documentation should have a “lab notebook” spin on it that tells the story about how you came to the decisions that shaped your analysis. Every decision should have a documented motivation with an understanding of the costs associated with those decisions.

For example, you very well might need to normalize a variable in your analysis. Yet, when you normalize that variable, the numbers associated with that variable will lose their units and might not be as readable to others. Moreover, others building off of your analysis might assume certain units based on column names, etc.

Elias Ponvert explain this very well in his post, How we do data science at people pattern:

Lab notebooks are so great. Without them, it’s genuinely really hard for a data scientist to pick up where he or she left off in an experimental project, even if it’s only been a day or two since she/he was working on it.

## Conclusions

Ensuring reproducibility in your data science workflows can seem like a daunting task. However, following a few best practices and utilizing appropriate tooling can get you there. The effort will be well worth it in the end and will pay off with an environment of collaboration, creativity, and compliance.

Continue reading Putting the science back in data science.

### Randy Hunt on design at Etsy

The O’Reilly Design Podcast: Collaborating with engineering, hiring for humility, and the code debate.

In this week’s Design Podcast, I sit down with Randy Hunt, VP of design at Etsy. We talk about the culture at Etsy, why it’s important to understand the materials you are designing with, and why humility is your most important skill.

Continue reading Randy Hunt on design at Etsy.

### Four short links: 19 January 2017

Attention and Learning, Reproducing Cancer Research, Deep Traffic, and Spreadsheets to Viz

1. Attention and Reinforcement in Learning -- summary of a recent article available on Sci-Hub (no preprints available on the authors' sites). The results also showed that selective attention shapes what we learn when something unexpected happens. For example, if your pizza is better or worse than expected, you attribute the learning to whatever your attention was focused on and not to features you decided to ignore. Finally, the researchers found that what we learn through this process teaches us what to pay attention to, creating a feedback cycle — we learn about what we attend to, and we attend to what we learned high values for. See also Reprioritizing Attention in Fast Data.
2. First Batch of Reproducibility Project Results Are In -- Elizabeth Iorns and others at the Reproducibility Project tried to reproduce landmark findings in cancer research, with only some success thus far. [P]erhaps the most important result from the project so far, as Daniel Engber wrote in Slate, is that it has been “a hopeless slog.” “If people had deposited raw data and full protocols at the time of publication, we wouldn’t have to go back to the original authors,” says Iorns. That would make it much easier for scientists to truly check each other’s work. The National Institutes of Health seem to agree. In recently released guidelines, meant to improve the reproducibility of research, they recommend that journals ask for more thorough methods sections and more sharing of data. And in this, the Reproducibility Project have modeled the change they want to see, documenting every step of their project on a wiki.
3. Deep Traffic -- a road simulator where you must code the self-driving car neural network, which is part of Deep Learning for Self-Driving Cars at MIT. If you are officially registered for this class you need to perform better than 65 mph to get credit for this assignment.
4. rawgraphs.io -- an open source data visualization framework built with the goal of making the visual representation of complex data easy for everyone.

### TensorFlow to support Keras API

I found this interesting blog post by Rachel Thomas. My favorite quote:

Using TensorFlow makes me feel like I’m not smart enough to use TensorFlow; whereas using Keras makes me feel like neural networks are easier than I realized. This is because TensorFlow’s API is verbose and confusing, and because Keras has the most thoughtfully designed, expressive API I’ve ever experienced. I was too embarrassed to publicly criticize TensorFlow after my first few frustrating interactions with it. It felt so clunky and unnatural, but surely this was my failing. However, Keras and Theano confirm my suspicions that tensors and neural networks don’t have to be so painful.

### Reflecting on 2016 to Guide BigML’s Journey in 2017

2016 has proven a whirlwind year for BigML with substantial growth in users, customers and the team riding on the realization by businesses and experts that Machine Learning has transformational power in the new economy where data is in abundance but actionable insights have not been able to keep pace with improvements in storage, computational […]

### The state of Linux security

Lessons learned from 2016’s most important Linux security events.

## Introduction

In the last 10 years, GNU/Linux achieved something some foresaw as almost impossible: powering both the smallest and biggest devices in the world, and everything in between. Only the desktop is not a conquered terrain yet.

The year 2016 had an impact on the world, from both a real life and digital perspective. Some people found their personal details leaked on the internet; others found their software being backdoored. Let’s have a look back on what happened last year regarding Linux security.

Continue reading The state of Linux security.

### Accelerating APIs with continuous delivery

Create business value and add new functionality through an automated build pipeline.

Much has been written about the web-based API economy, and there are clear benefits to an organization for exposing their services and offerings via an API. However, this goes deeper than simply opening up new consumers (and markets) and allowing “mashups” of functionality. A good public-facing API communicates its intent and usage in a much more succinct and effective way than any human-facing UI and accompanying user manual, and an API is typically easier to test, deliver and operate at scale. But in order to ensure the continual delivery of value via an API, the process of build, validation and quality assurance must be automated through a build pipeline.

## Putting an API through the pipeline

The first step in any attempt to create business value or add new functionality is to ensure that a hypothesis has been specified, a supporting experiment designed, and metrics of success defined. Once this is complete we like to work closely with the business stakeholders to specify the user journey of an API via tooling like Serenity BDD and rest-assured using an ‘outside-in’ approach (i.e., defining external functionality before working on the internal components).

Continue reading Accelerating APIs with continuous delivery.