9:00 Xavier Amatriain - Netflix machine learning
Background: Research/Engineering Director at Netflix
Why didn’t they use Netflix final prize? The model complexity created engineering challenges that didn’t justify the business benefits. You can have a highly accurate model but if it requires too many moving parts, then it’s not going to be worth it. Businesses prefer models that are simpler to deploy and willing to trade off some accuracy for ease of use.
Netflix data science team is focused on 4 major problems: Ranking, Similarity, Row selection, Search recommendations.
Ranking is the most important topic within Netflix because it is the key component used in search, advertising and recommendations.
The “Learning to rank problem” is defined as can you construct a model from training data that can rank elements in a defined order?
Three main approaches to learning to the rank problem:
Pointwise. The loss function defined on individual relevance judgement. Ordinal regression, Logistic regression, SVM, GBD.
Pairwise: ranksvm, rankboost, ranknet, f-rank
Listwise
Similarity is important, graph based similarity is a interesting approach:
SimRank Jeh & Widom
2 objects are similar if they are referenced by similar objects
Row selection (clustering):
How do you find what group of movies to recommend to users?
How do you avoid duplicate rows?
Search recommendations:
Play after search: transition play after query
Play related to search: Similarity between users (query, play)
Linkedin: Metaphor, a system for related search recommendation
Q: Do these systems modify preferences over time?
A: We’ve looked in this problem and have tried various approaches to combat this tendency. “Random picks” is one, using multi-armed bandit models is another.
10:30 Quoc Le - Deep learning at Google
Background: Google Engineer/Stanford PHD
Deep learning’s main advantage is that you require less domain knowledge. No need Engineer features automatically.
What is deep learning? Non-linearity is the key, it allows you to represent very complex functions that you can use to fit to your data.
At Google, we developed DistBelief to train dnn on many machines.
Goal: Train dnn on many machines
Model a multiple layered architecture
Forward pass to compute features
Backgward pass to compute gradient
How is computation distributed?
Model partitioning:
- every model computation is split into multiple machines
- Parameter server
Applicatoins of deep learning currently used at google:
voice search
photo search
text understanding
Voice search:
Extract speech frame 1 sec, send it through NN to classify phonemes
Hidden layer with 1000s of nodes
Classificaiton top layer
Photo search:
1.5 Bn parameters in NN
Example categories
Archery
Seat belt
Shredder
Boston rocker
Also works on cartoon images
Text understanding:
Start by understanding the “meaning” of a word: Does “apple” mean the company vs the fruit?
Every word is mapped into 100D space
read text
input is the previous N words
Relation extraction
Mikolov, Sutskever, Le: Learning the meaning behind words
Use it for machine translation
Q: Is your softmax layer in your parameter server?
A: No it’s distributed to the model
Q: Asynchronous SGD, whats the theoretical garauntees?
A: Turns out it works great, we embarked on this by first relaxing the need to have completely updated weights in all distributed models, but after finding out it worked so great, we began looking for papers that could give us theoretical bounds, and we were successful in locating those papers. It turns out, the key metric that will determine how well your models converge is the delay between your parameter server and the distributed models.
11:00 Joseph Gonzalez - understanding graph data
Background: Co-founder at GraphLab
A large portion of data available today on the web is structured in the form of graphs. Application of graph processing
determine the cohesiveness of communities - finding “triangles”
recommending products
identify leaders in communities using pagerank on follower structures
topic modeling
Map-reduce is not the best computation primitive for graphs. Computation
models for graphs is different.
Graph-parallel abstraction:
User defines a program that lives on graph nodes
Program can only interact with neighboring edges and nodes
Programs must signal to neighbors to update
time represented as epochs or iterations through the graph
Power-law degree distribution – large number of low degree vertices, and small number of high degree vertices
High degree vertices present challenges in scaling graph algorithms. They become single point of high communication overhead.
Solution proposed by graphlab2 split high degree vertex across multiple machines. Do this by splitting vertex program into 3 phases:
Gather
Apply (update)
Scatter (trigger neighbors)
2:00 Scott Triglia - Building a data product at Yelp
Background: Yelp Engineer
Goals of the talk:
Show ourthought process when building recommendations
Expose some useful questions
Practical solutions for new teams
Yelp overview:
108 Million monthly visitors, 42 million reviews
half searches coming from mobile devices
Main topic:
Recommendation service for nearby places
Meant to replace the unpersonalized nearby search systems
Decision time:
Gather team 3 backend engineers, 3 frontend engineers
First step is to figure out what problem is a recommendation system trying to solve?
First question … who else is solving a similar problem? Netflix!
Second question … what’s different from our scenario vs Netflix? The context of location and time is very important.
Organizational context matters too:
very little infrastructure support for large-scale ML
must scale to all yelp data on day 1
team was small
this is a first version of a hopefully long lived product
Guiding principles:
Fast response time for pruning
Beating a benchmark was not the main goal, wanted to build a “good product”
System architecture:
API Request -> “Experts” nodes -> search
The experts metaphor allows you to modularize
Each expert ranks their own results
Observation:
Experts don’t have to be very prolific to be valuable.
Tips:
Solve your own problem not someone else’s
Being cuting edge may not be the top priority
Build for the tools you have.
Get your end-to-end systems working first!
Good software engineering enables quality ML
Q: What was the release process like?
A: We did iterative internal releases to try to answer all the questions of how the system was working. Being able to answer “why” a recommendation was made was important.
Q: What were key metrics?
A: Nearby page view times
2:30 Jake Mannix - Content based recommender systems
Background: User modeling at twitter, former LinkedIn data science team
Recommender systems:
- “Groups you may like”
- “Article you may be interested in”
- “Ads you may like”
Aside, “users” don’t have to be users, LinkedIn wanted to build a generalized recommender framework. We can look at the following scenarios as a [user, item to be recommended]
- TalentMatch [user=job posting, item=user-profile]
- Jobs for your group [user, group]
- Jobs for your group [group, job]
- Ads you may like [user, ad]
Collaborative filtering is Generic!
- User/items reduced to UUIDS, gould be anything
What about domain specific knowledge?
- Users are more than just account names
Features!
each user has a feature vector
each item has a feature vector