Real-time Materialized Views. ML Systems Design. Mastering the Phone Screen. BERT @ Google. [DSR #210]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
The Materialize Incremental View Maintenance Engine
…
This is definitely the most exciting thing I’ve come across in the past little while. I spoke to the CEO this week and got the download on it, and have also watched the talk from the link above, given by their Chief Scientist.
Materialize is a streaming SQL materialized view product. It competes directly with KSQL, but has some important relative technical and UX benefits. It’s all based on long-running academic research (since 2013) and an open-source project called Timely Dataflow.
IMO, the main benefit vs. something like KSQL is that it handles joins seamlessly. It can handle the entire set of TPC-H queries, including 8-way joins, and can keep multiple nested layers of materialized views up-to-date within milliseconds. Its data is stored in a columnar format under-the-hood, so it’s optimized for analytic workloads.
While the research has been ongoing for a while, the commercial product is only just now being released. Beta is Jan/Feb. If you have use cases for streaming SQL processing, I’d highly recommend watching this full talk and staying in touch with Materialize as the launch happens.
We plan on experimenting with building a dbt adapter as soon as we can get our hands on an early release. If Materialize can deliver on the demo shown in this talk, it unlocks a bunch more use cases for data pipelines.
Machine Learning Systems Design
Chip Huyen put together a fantastic resource for anyone studying for ML interviews:
This part contains 27 open-ended questions that test your ability to put together what you’ve learned to design systems to solve practical problems. Interviewers give you a problem, possibly related to their products, and ask you to design a machine learning system to solve it. This type of question has become so popular that it’s almost guaranteed that you’ll be asked at least one during your interview process. In an hour-long interview, you might have time to go over only one or two questions.
It’s also one of the most comprehensive overviews of the topic I’ve ever seen. There are dozens of links, each of which is to a foundational piece of writing in the field.
I have been conducting a lot of phone screenings for Data Scientist roles lately, and I can only speak to my own experience and opinions, but I can give a few tips. (Note, these are not "universal" tips that will work for every interviewer, but things I noted)
While we’re on the topic of interviewing, this tweetstorm from Renee Teate (32 tweets long, click through for the whole thing!) is a great resource for anyone preparing for technical phone screens. She’s done a lot of them and highlights what you should be focused on at this stage in the process. 💯
Understanding searches better than ever before
Google doesn’t write about its work on the actual Search product that often—much of Google’s work on Search, due to its inherently adversarial nature with the entire SEO industry, is pretty opaque. But the SVP of Search, Pandu Nayak, recently posted an update that Search is now using BERT to respond to about 10% of all US-English queries. What I found most interesting were the instances of its outperformance of prior results; it’s night-and-day better for a certain class of questions.
Here’s the summary:
Search is not a solved problem No matter what you’re looking for, or what language you speak, we hope you’re able to let go of some of your keyword-ese and search in a way that feels natural for you. But you’ll still stump Google from time to time. Even with BERT, we don’t always get it right. If you search for “what state is south of Nebraska,” BERT’s best guess is a community called “South Nebraska.”
If Google is only now incorporating BERT into its flagship product, we are certainly a long way from seeing the commercialization of much of the stunning research wins that have happened in NLP over the past 12-24 months. Lots more to come.
Beginners Guide to Columnar File Formats
File formats can be confusing, so lets delve into Columnar file formats (like Parquet) and explain why they’re different to regular formats (like CSV, JSON, or Avro).
I love this post because it almost exactly mirrors a talk that I give during the Fishtown Analytics onboarding process, but with better examples! Understanding columnar file formats is critical to understanding the internals of modern analytic SQL engines and therefore critical to using them well.
If you’re not familiar with this topic, this is a must-read. If you already know this stuff cold, save this as a resource for future people who join your team.
blog.matthewrathbone.com • Share
Thanks to our sponsors!
dbt: Your Entire Analytics Engineering Workflow
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Stitch: Simple, Powerful ETL Built for Developers
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123