Talk Summary: Building Great Data Products

Yesterday I went to a great talk by DJ Patil called "Building Great Data Products." DJ has an impressive background in building data-centric products: he was head of LinkedIn's data products for several years, then Data Scientist in Residence at Greylock, and is now the VP of Product at RelateIQ, a CRM tool whose homepage reads, "The Beginning of Data Science in Decision Making." He also co-coined the term Data Scientist.

DJ discussed some of the lessons he learned while building products at LinkedIn and RelateIQ, and the following is a summary of my notes from the talk:

DJ's talk was great and I can vouch for many of these lessons personally based on my work at Google and Factual. I think the most valuable lesson that I've learned over the last decade is one that came up repeatedly during the talk: simple approaches are often surprisingly effective. For example, I remember one task where I had a sparse dataset and had to fill out as much of the missing data as possible. Instead of using fancy algorithms and sophisticated machine learning, I tried the following heuristic: for every pair of columns, if a value, X, in one column was associated with a value, Y, in another column almost all of the time, then every time the first column value was X and the second column value was missing, I'd set the second column value to Y. It was a very naive approach, and yet it managed to fill in a large chunk of my dataset. I've now used this heuristic for data about books, movies, places of interest, and other datasets, and it often makes more clever strategies unnecessary or not worth the time.

Another important lesson that I've learned is that it's a great idea to work with small samples of data and use a single machine for as long as possible. Hadoop and distributed systems are nice when you're running in production on terabytes of data, but they greatly diminish your development speed and ability to experiment. You'll probably make progress much more rapidly if you just load a 500MB slice of data into RAM and experiment on your laptop.

After DJ's talk, I started thinking about the many blog posts that I've seen that focus on technologies that are commonly used for working with data: Hadoop, scipy, regular expressions, etc. I'd love to see more blog posts (and books) about higher level strategies and tactics for building data products. Posts that offer suggestions like "work with small samples"; "leverage Mechanical Turk"; and "start with the simplest approaches." I might turn a few of these topics into future blog posts, but I'm sure there are many lessons that I haven't learned yet. If you know of any great resources for creating data products, please mention them in the comment section!

Tags: NotesTalk Notes

Share this post

Subscribe by Email
Copyright © 2013-2016 Leo Polovets