2013-11-15

9:00 Xavier Amatriain - Netflix machine learning

Background: Research/Engineering Director at Netflix

Why didn’t they use Netflix final prize? The model complexity created engineering challenges that didn’t justify the business benefits. You can have a highly accurate model but if it requires too many moving parts, then it’s not going to be worth it. Businesses prefer models that are simpler to deploy and willing to trade off some accuracy for ease of use.

Netflix data science team is focused on 4 major problems: Ranking, Similarity, Row selection, Search recommendations.

Ranking is the most important topic within Netflix because it is the key component used in search, advertising and recommendations.

The “Learning to rank problem” is defined as can you construct a model from training data that can rank elements in a defined order?

Three main approaches to learning to the rank problem:

Pointwise. The loss function defined on individual relevance judgement. Ordinal regression, Logistic regression, SVM, GBD.

Pairwise: ranksvm, rankboost, ranknet, f-rank

Listwise

Similarity is important, graph based similarity is a interesting approach:

SimRank Jeh & Widom

2 objects are similar if they are referenced by similar objects

Row selection (clustering):

How do you find what group of movies to recommend to users?

How do you avoid duplicate rows?

Search recommendations:

Play after search: transition play after query

Play related to search: Similarity between users (query, play)

Linkedin: Metaphor, a system for related search recommendation

Q: Do these systems modify preferences over time?
A: We’ve looked in this problem and have tried various approaches to combat this tendency. “Random picks” is one, using multi-armed bandit models is another.

10:30 Quoc Le - Deep learning at Google

Background: Google Engineer/Stanford PHD



Deep learning’s main advantage is that you require less domain knowledge. No need Engineer features automatically.

What is deep learning? Non-linearity is the key, it allows you to represent very complex functions that you can use to fit to your data.

At Google, we developed DistBelief to train dnn on many machines.

Goal: Train dnn on many machines
Model a multiple layered architecture

Forward pass to compute features

Backgward pass to compute gradient

How is computation distributed?

Model partitioning:
- every model computation is split into multiple machines
- Parameter server

Applicatoins of deep learning currently used at google:

voice search

photo search

text understanding

Voice search:

Extract speech frame 1 sec, send it through NN to classify phonemes

Hidden layer with 1000s of nodes

Classificaiton top layer

Photo search:

1.5 Bn parameters in NN

Example categories

Archery

Seat belt

Shredder

Boston rocker

Also works on cartoon images

Text understanding:

Start by understanding the “meaning” of a word: Does “apple” mean the company vs the fruit?

Every word is mapped into 100D space

read text

input is the previous N words

Relation extraction

Mikolov, Sutskever, Le: Learning the meaning behind words

Use it for machine translation

Q: Is your softmax layer in your parameter server?

A: No it’s distributed to the model

Q: Asynchronous SGD, whats the theoretical garauntees?

A: Turns out it works great, we embarked on this by first relaxing the need to have completely updated weights in all distributed models, but after finding out it worked so great, we began looking for papers that could give us theoretical bounds, and we were successful in locating those papers. It turns out, the key metric that will determine how well your models converge is the delay between your parameter server and the distributed models.

11:00 Joseph Gonzalez - understanding graph data

Background: Co-founder at GraphLab



A large portion of data available today on the web is structured in the form of graphs. Application of graph processing

determine the cohesiveness of communities - finding “triangles”

recommending products

identify leaders in communities using pagerank on follower structures

topic modeling

Map-reduce is not the best computation primitive for graphs. Computation
models for graphs is different.

Graph-parallel abstraction:

User defines a program that lives on graph nodes

Program can only interact with neighboring edges and nodes

Programs must signal to neighbors to update

time represented as epochs or iterations through the graph

Power-law degree distribution – large number of low degree vertices, and small number of high degree vertices

High degree vertices present challenges in scaling graph algorithms. They become single point of high communication overhead.

Solution proposed by graphlab2 split high degree vertex across multiple machines. Do this by splitting vertex program into 3 phases:

Gather

Apply (update)

Scatter (trigger neighbors)



2:00 Scott Triglia - Building a data product at Yelp

Background: Yelp Engineer

Goals of the talk:

Show ourthought process when building recommendations

Expose some useful questions

Practical solutions for new teams

Yelp overview:

108 Million monthly visitors, 42 million reviews

half searches coming from mobile devices

Main topic:

Recommendation service for nearby places

Meant to replace the unpersonalized nearby search systems

Decision time:

Gather team 3 backend engineers, 3 frontend engineers

First step is to figure out what problem is a recommendation system trying to solve?

First question … who else is solving a similar problem? Netflix!

Second question … what’s different from our scenario vs Netflix? The context of location and time is very important.

Organizational context matters too:

very little infrastructure support for large-scale ML

must scale to all yelp data on day 1

team was small

this is a first version of a hopefully long lived product

Guiding principles:

Fast response time for pruning

Beating a benchmark was not the main goal, wanted to build a “good product”

System architecture:

API Request -> “Experts” nodes -> search

The experts metaphor allows you to modularize

Each expert ranks their own results

Observation:

Experts don’t have to be very prolific to be valuable.

Tips:

Solve your own problem not someone else’s

Being cuting edge may not be the top priority

Build for the tools you have.

Get your end-to-end systems working first!

Good software engineering enables quality ML

Q: What was the release process like?

A: We did iterative internal releases to try to answer all the questions of how the system was working. Being able to answer “why” a recommendation was made was important.

Q: What were key metrics?

A: Nearby page view times

2:30 Jake Mannix - Content based recommender systems

Background: User modeling at twitter, former LinkedIn data science team

Recommender systems:
- “Groups you may like”
- “Article you may be interested in”
- “Ads you may like”

Aside, “users” don’t have to be users, LinkedIn wanted to build a generalized recommender framework. We can look at the following scenarios as a [user, item to be recommended]
- TalentMatch [user=job posting, item=user-profile]
- Jobs for your group [user, group]
- Jobs for your group [group, job]
- Ads you may like [user, ad]

Collaborative filtering is Generic!
- User/items reduced to UUIDS, gould be anything

What about domain specific knowledge?
- Users are more than just account names

Features!

each user has a feature vector

each item has a feature vector

Show more