Arrow Flight. Polynote. 2019 Coding Salaries. Glue Work in Analytics. The Importance of Data Analysts. [DSR #203]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Introducing Apache Arrow Flight: A Framework for Fast Data Transport
This post introduces Arrow Flight, a framework for building high performance data services. We have been building Flight over the last 18 months and are looking for developers and users to get involved.
I’ve been interested in Apache Arrow for years: Wes McKinney’s next project after Pandas has a tremendous amount of promise to move the entire data processing landscape forwards. This post announces their latest release: Arrow Flight is a mechanism to move data quickly.
This sounds somewhat boring relative to the things that data analysts and scientists more commonly think about. Why does this matter?
The speed of data transport is a bit like a “law of physics” for data processing: this speed determines how all downstream data applications are built. For example, slow data transport is one of the primary reasons that industry is currently moving towards ingesting all organizational data into a single warehouse/lake and doing all processing from there.
With Arrow (and Arrow Flight), though, this “data locality” constraint begins to relax. All the sudden, you can think about building applications differently. Dremio is a product that’s been built for this paradigm (and utilizes Arrow under the hood). It’s enabling data processing against non-local and heterogeneous data stores at speed. The article itself is fantastic if you’d like to dig into how the technology actually works.
Keep an eye on this space—I really think this is one of the most interesting projects in all of data right now.
Coding Salaries in 2019: Updating the Stack Overflow Salary Calculator
Devops has jumped out in front of Data Science and Data Engineering as the most highly-paid software engineering role! This is not at all surprising given the tremendous influx of junior talent into data science, and the commodification of the low end of data engineering by off-the-shelf tooling.
Netflix Polynote: an IDE-inspired Polyglot Notebook
We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.
👀
I’ve talked about glue work in this newsletter before. It’s an incredibly important concept on technical teams of all stripes, and whether you’re a manager or an IC. If you haven’t watched the talk linked above, you should.
This post by Caitlin Moorman talks about glue work within the context of an analytics team. What does it look like? How should it be recognized? What should you do if your glue work isn’t being recognized? And more.
What’s so fantastic about this article is that Caitlin points out that, in some ways, all analytics is glue work. Typically, the modern data team doesn’t own any revenue-focused KPIs and is instead is heavily involved in helping the other teams in the business hit theirs. This makes the question of how to value the work of data analysts, especially the work we do producing anything other than lines of code, particularly hard (and particularly important!).
www.locallyoptimistic.com • Share
Data Science’s Most Misunderstood Hero
When in doubt, hire analysts before other roles. Appreciate them and reward them. Encourage them to grow to the heights of their chosen career (and not someone else’s). Of the cast of characters mentioned in this story, the only ones every business with data needs are decision-makers and analysts.
This is well-trodden territory for me (and likely for you), but this article by Cassie Kozyrkov does a good job of exploring why data analysts are unbelievably important, why they’re distinct from other data roles, why they have their own career paths, and why they should be hired first.
towardsdatascience.com • Share
Buffet lines are terrible, but let's try to improve them using computer simulations
Yay, a queueing simulation problem! I think operations research is surprisingly fun, and the linear optimization problems it involves are a totally different class of problems than most modern data scientists find themselves thinking about. This is one of the cooler posts I’ve seen on the topic, including some great simulations, from the inimitable Erik Bernhardsson.
TL;DR: Recently, DuckDB a database that promises to become the SQLite-of-analytics, was released and I took it for an initial test drive.
Very cool. Great overview post.
On Supporting Efficient Snapshot Isolation for Hybrid Workloads with Multi-Versioned Indexes
Dig in at your own risk—this is fairly dense. That said, it’s part of my obsession with low-latency analytical systems (more in issue #197). I think that such systems have the potential to significantly change how businesses operate by plugging in analytical systems directly into operational systems to automate actions. The current “modern” tech stack has a latency SLA that is too high for the operational use case, but there continue to be threads getting pulled that could get us closer.
Thanks to our sponsors!
dbt: Your Entire Analytics Engineering Workflow
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Stitch: Simple, Powerful ETL Built for Developers
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123