SparkR, Query Optimization, xkcd, and ML in Financial Markets [DSR #96]

Jul 30, 2017

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

From Our Readers

Two great posts from Data Science Roundup readers this week! If you want to see yours here, email me.

What to Watch for when Moving from R to SparkR

Data Science Roundup reader Vicki Boykis has a great new post on the idiosyncrasies of SparkR’s dataframe. As a data scientist using R, SparkR is an incredibly powerful tool to extend your existing skillset into the world of parallelized computing, but it’s important to understand what’s going on under the hood. Vicki’s article does a great job of showing exactly that.

Also, I’m embarrassed by just how much I enjoyed this joke from the article:

Some people, when confronted with a problem, think “I know, I’ll use multithreading”. Nothhw tpe yawrve o oblems.

It’s a good point: use Spark only when you have to.

veekaybee.github.io • Share

A Jupyter Magic for Cell Completion Notifications

Reader Michelangelo D'Agostino built a cool Jupyter utility that notifies you on the completion of a long-running cell. Stop alt-tabbing back into your browser to see if your model has finished training!

github.com • Share

This Week's Top Posts

How To Write Better Queries

SQL optimization is such a critical topic, and one that too few data professionals go deep on. If you’re not familiar with how a query optimizer works, how to read an explain plan, or what linear time is, this article is a great getting started guide. Share this post with your colleagues who are bogging down your Redshift cluster 😉

Note that much of this post was written in the context of a traditional relational engine like MySQL. The core concepts are very relevant for modern analytic databases although the specific recommendations are somewhat less applicable.

www.datacamp.com • Share

Nodebook | Stitch Fix

Nodebook is a fascinating extension to Jupyter Notebooks that makes it easier to treat notebooks like real code: it maintains state and relationships across cells, ensuring that regardless of the execution order, you always get coherent results. Usable today—check it out.

multithreaded.stitchfix.com • Share

Why is LTV:CAC Still a Thing?

At this point in the development of the SaaS business model, the metrics one uses to evaluate a SaaS business are fairly well-known. This article is a wonderful new take on one of the core metrics in SaaS: the LTV (lifetime value) to CAC (customer acquisition cost) ratio.

If you work at a subscription-based business, this is a must-read.

labs.openviewpartners.com • Share

Every xkcd on Data Science

Sometimes it’s better to tell people they’re being stupid while making them laugh at the same time. There are tons of xkcd comics that illustrate the strange ways we can all become confused, and this post pulls together many of the best. Use them in your PowerPoints 😊

livefreeordichotomize.com • Share

A SQL Connector for the Ethereum Blockchain

This is super freaking cool: a utility that allows you to connect Presto (an open source SQL engine) to the Ethereum blockchain. Complete installation instructions plus example queries.

While I don’t recommend plugging this into Mode Analytics and becoming a cryptocurrency day trader, there are so many fascinating and profitable things to be done with this… Want some ideas? Check this out.

github.com • Share

OpenAI: Better Exploration with Parameter Noise

We’ve found that adding adaptive noise to the parameters of reinforcement learning algorithms frequently boosts performance.

Randomness helps neural networks get unstuck from local minima. The deeper we go into AI research, the more I find that I can take personal life lessons from the findings. Everyone needs some change, even your RNNs.

blog.openai.com • Share

Impact Of Artificial Intelligence And Machine Learning on Trading And Investing

If you don’t care about finance and investing, skip this. I found it fascinating. Here are a couple of quotes:

…broad acceptance of [AI / ML] is slow due to various factors, the most important being that AI requires investment in new tools and human talent. The majority of funds use fundamental analysis because this is what managers learn in their MBA programs. There are not many hedge funds that rely solely on AI.

…I believe the transition for most traders will not be possible. The combination of skills required for understanding and applying AI rules out 95% of traders used to drawing lines on charts and watching moving averages.

In short: there is plenty of work to be done (and money to be made) in bringing data science to financial markets.

medium.com • Share

Data viz of the week

From 34,476 DC and Marvel characters.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.

fishtownanalytics.com • Share

Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.

www.stitchdata.com • Share

By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

915 Spring Garden St., Suite 500, Philadelphia, PA 19123

The Analytics Engineering Roundup

Discussion about this post

Ready for more?