Internet Trends. Data Warehouse SLAs. Training Data. [DSR #138]
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
Editorial Perspective.
The Data Science Roundup was started in September of 2015, coming up on three years ago(!). Over that period I’ve solidified what I like to cover, so I want to take just a second to share exactly what criteria I use to include or exclude a post. My goal is to make it clear exactly what you should expect in your inbox every Sunday morning.
I have a broad definition of “data science”. I care about, and cover, everything from data engineering to business intelligence to statistics to ML & AI to visualization. If you’re a data scientist, you should probably care about all of these things too.
I value applications over theory. Both are valuable, but I want this newsletter to first and foremost be useful. I favor content that showcases industry experts relaying their experiences solving real-world problems.
I only link to something once. Most of you are long-time, loyal readers. You don’t need yet another post explaining deep learning or Bayes’ Law—by now, I’ve covered those basics (feel free to revisit in the archives). Today, when I link to a post, it’s because that post brings something brand new.
Nothing is too advanced or too basic. Important ideas are sometimes complex and sometimes simple.
Every week I scan through headlines of 3-500 posts to get down to the 5-8 that I include here, and those are the rules. I hope you like the end product.
If you enjoy getting the Roundup every week, my only ask is that you spread the word by forwarding this email to three friends. The Roundup grows through your recommendations! As always, thank you: it is a privilege to have you as a reader. 🙏
- Tristan
The Week's Most Useful Posts
Mary Meeker's 2018 Internet Trends
Data scientists are specialists. Because of that, there is a tendency among those in the field to go ever-deeper into existing areas of expertise rather than going wide. This is suboptimal, in my opinion: data scientists are ultimately problem-solvers, and most solutions to hard problems are combinatorial, involving insights from many domains.
This slide deck is the authoritative source of the “state of the internet”. Every year, it is read throughout tech, from VCs to startup founders to CEOs of F500 companies. There are ~300 slides in it, containing lots of primary research on the state of the internet today. It provides the backdrop for, the context of, many of the problems you are trying to solve today.
Zooming out is important.
Should Your Data Warehouse Have an SLA?
Yes, if you want to build a truly data-driven organization your data warehouse needs a Service Level Agreement (SLA). At the core of any data driven organization is trust - your stakeholders must trust that when they need data, it will be there and it will be accurate. Without trust in the data warehouse, your organization will be less likely to use data to drive decisions big and small.
Could not agree more, and “SLA” just isn’t in most analysts’ vocabularies today.
If you’re a close reader, you’ll notice that I’ve started feverishly linking to almost every post by brand new blog Locally Optimistic. The blog is a collaborative effort between several prominent folks in the NY data scene, and it’s one of the very few places I’ve found where smart people are saying smart things about building data teams. I believe that this is the hardest part of solving data problems today, so this blog couldn’t be more timely.
I definitely recommend following the blog directly, or just stay tuned here as I’m sure I’ll continue to link to many of their posts.
www.locallyoptimistic.com • Share
Advice For Applying To Data Science Jobs
This post is a collection of my thoughts and recommendations for people interested in applying to data science jobs in the US.
Exhaustive, excellent.
Why You Need to Improve Your Training Data, and How to Do It
As part of my job I work closely with a lot of researchers and product teams, and my belief in the power of data improvements comes from the massive gains I’ve seen them achieve when they concentrate on that side of their model building. The biggest barrier to using deep learning in most applications is getting high enough accuracy in the real world, and improving the training set is the fastest route I’ve seen to accuracy improvements.
The recommendation itself is not new or surprising—data over algorithms has become conventional wisdom in the era of deep learning—but the stories the author tells in making the point are really excellent, as are the recommendations. The author knows his subject cold; this is the best post I’ve seen on this particular topic.
Statistics for People in a Hurry
This post isn’t for you, it’s for the people you work with! Keep it in your back pocket (or your browser favorites) to send to people who ask fundamental stats questions. It hits the core concepts effectively while avoiding getting into the weeds.
Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant that wish for you! I’ll zoom through all the biggest ideas in statistics in 8 minutes!
towardsdatascience.com • Share
Data viz of the week
Gorgeous data viz / art / photography by Marcus Lyon, brilliantly summarized by Nathan Yau:
Artist Marcus Lyon imagines worlds where there are so many people that the only thing left to do is to make gigantic places to fit everyone. The patterns repeat themselves over and over, and it’s no longer about the individual exploring an entire place.
There are many additional images at the link and all are excellent. It’s impossible to visualize these huge places without the top-down view that satellite imagery provides.
Thanks to our sponsors!
Fishtown Analytics: Analytics Consulting for Startups
At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.
Stitch: Simple, Powerful ETL Built for Developers
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123