Holdenweb.blogspot.com

Puget Sound Python User Group, Jan 14, 2015

2015-01-14

The meeting started with one of the meeting's organizers, Tammy Lee, welcoming us to Dice Cabana. Alan Besner then took the MC slot to introduce tonight's speakers after a word from OfferUp's Arean, who explained that as the largest local mobile marketplace in the US they have collected some high volume data, and will be listening attentively to the talks. Unsurprisingly, OfferUp are hiring (their staff having grown about 400% in the last year).

Tammy announced that the group was interested in helping to reach out to members and help them with their professional development. The group has a new logo designed by New Relic's Jen Pierce. It tweets as @ps_python, and thet have a LinkedIn group "to make things less socially awkward." Tammy will be happy to talk to any members who want to get involved: pugetsoundpython at gmail dot com.

The first speaker was Carlos Guestrin, founding CEO of Dato (formerly GraphLab), who is also Professor of Machine Learning at UW. His talk was about recent developments in machine learning. Companies have been collecting data for a long time, but early attempts at analytics were not really addressing the real questions.

In talking to over 300 businesses with interests in machine learning, Carlos again and again has seen a failure to address the need to eliminate what Carlos called "duct tape and spit" work. Building tailored Hadoop systems is expensive and time-consuming, so we need a way to scale Python processes to the standard big-data ecosystem.

Dato's mission was to empower people to handle all sorts of data sources at scale, even from a laptop. Users should never have out-of-memory issues, but the GraphLab Create product allows building analytic tools that will move up to web scale with GraphLab Produce.

A first live demo was an image search, hosted on Amazon, matching with predictions made by a neural network running on Amazon infrastructure. An engineer who joined the team last October built this demo in a few days.

A second demo showed a recommender built from Amazon product data. It allowed two people to list their tastes and then found a way from a favored product of one user to a favored product of another.

Carlos is interested in allowing people to build their own applications, so he showed us an IPython Notebook that read in Amazon book review data. An interesting feature of Graphlab objects is that they have graphical representations in the Notebook, with the ability to drill down.

Carlos built and trained a book recommender with a single function call. This is all very well, but how do we scale such a service? The answer is to take the recommender model and deploy it as a service hosted on EC2 over a number of machines. The system is robust to the failure of single systems, and uses cacheing to take advantage of repeated queries to the RESTful API.

The underlying support infrastructure is written in "high-performance C++". The GraphLab components are based on an open source core, with an SDK that allows you to extend your coding however you want. GraphLab Create is unique in its ability to allow a laptop to handle terabyte volumes. It includes predictive algorithms and scales well - Carlos showed some impressive statistics on desktop performance. TAPAD had a problem with 400M nodes and 600M edges. Graphlab handled it where Mahout and Spark had failed.

One of the key components is the SFrame, a scalable way to represent tabular data. These representations can be sharded, and can be run on a distributed cluster without change to the Python code. Another is the SGraph, which offers similarly scalable aproaches to graph data.

The toolkits are based on an abstract pyrsid wehose base in the algorithms and SDK. Upon this is layered a machine learning layer.

Deployment is very important. How can you connect predictive services to your existing web front-end? GraphLab aloows you to serve predictions in a robust scaleable way. Install

Carlos closed by pointing out the Data Science and Dato Conference, whose fourth incarnation is in July in San Francisco.

Q: What's the pricing model?
A: You can use GraphLab Create for free; you can use Produce and other services either on annual license or by the hour.

Q: How do you debug distributed machine learning algorithms.
A: It's hard. We've been working to make it easier, and Carlos pushes for what he calls "insane customer focus"

Q: Is SFrame an open source component?
A: Yes.

Q: What about training - how do you trrain your models"
A: We have made it somewhat easier by producing tools. HTe second layer, choosing models and parameters, is assisted by recent developments. The next release will support automatic feature search.

Q: Is the data store based on known technologies?
A: It isn't a data store, it's en ephemeral representation. Use storage appropriate to your tasks.

Q: Do you have a time series components?
A: Yes, we are working on it, but we want to talk about whjat's interesting to potential customers?

Q: Do you have use cases in personalized medicine?
A: We've seen some interesting applications; one example reduced the cost of registering new drugs.

Erin Shellman works at Nordstrom Data Lab, a small group of computer and data scientists. She built scrapers to extract data fomr sports retailers. Erin reminded us of the problems of actually getting your hands on the data you want. Volumes of data are effectively locked up in DOMs.

The motivation for the project was to determine to optimum point to reduce prices on specific products. It was therefore necessary to extract the competitive data, which Erin decided to do with Scrapy. Erin likes the technology because it's adaptable and, being based on Twisted, super-fast.

Using code examples (available from a Github repository) Erin showed us how to start a web scraping project. The first principal components are items.py, which tells Scrapy which data you want to scrape. In order to extract this kind of data means you really should look at the DOM, which can yield valuable information (one customer was including SKU-level product availability information in hidden fields!)

The second component is the web crawler. Erin decided to start at one competitor's "brands" page, then follwo the brand's "shop all" link, whidh got her to a page full of products. The crawl setup subclasses the Scrapy CrawlSpider class. The page parser uses yield to generate successive links in the same page. Erin pointed out that Scrapy also has a useful interactive shell mode, which allows you to testt he assumptions on which you build your crawler.

Erin found that smaller brands didn't implement a "shop all" page, and for those cases she had to implement logic to scrape the data from the page where it would normally appear.

Erin's description of how to handle paginated multi-page listings showed that Scrapy will automatically omit already-processed pages, meaning it's easy to perform redundant searches where the same page may appear mutiple times in the result.

Erin underlines the necessity of cleaning the data as early in the processing chain as possible, and showed some code she had used to eliminate unnecessary data. Interestingly Erin ended up collecting a lot more data than she had initially started by looking for. Disposition of the scraped data can be handled by a further component of Scrapy, which was not covered.

Most of data science is data wrangling. Scraper code at https://github.com/erinshellman/backcountry-scraper

Erin also pointed out that the next PyLadies Seattle is on Thursday, January 29th.

Q: If you repeat scrapes, is there support for discovering DOM structure changes?
A: Good question, I don't know. I assume there must be, but didn't feel I needed it.

Q: If you ran this job over the course of 10 minutes (with 40,000+ GET requests) do you need to use stealth techniques?
A: We haven't so far. Retailers aren't currently setting their web sites up to protect against such techniques.

Q: What's the worst web site you've come across to scrape?
A: I don't want to say; but you can tell a lot about the quality of the dev teams by scraping their pages.

Q: Does Scrapy have support to honor robots.txt
A: No, but the crawler can implement this function easily.

Q: What techniques did you use to match information?
A: Pandas was very helkpful in analysis.

Q: What about web sites with a lot of client-side scripting?
A: We didn't run into that problem, so didn't have to address the issues.

Trey Causey from Facebook rose after the interval to talk on Pythonic Data Science: from __future__ import machine_learning. Try's side job is as a statistical consultant for an unnamed NFL team.

The steps of data science are

1. Get data
2. Build Model
3. Predict

Python is quickly becoming the preferred language of the data science world. NumPy has given rise to Pandas, whose DataFrame structure is based on it. Pandas can interface with scikit-learn.

A DataFrame offers fast ETL operatiopns, descriptive statistics, adn the ability to read in a wide variety of data sources. Trey showed a DataFrame populated with NFL data in which each row represented a play in a specific game, which had been processed to remove irrelevant events.

Scikit-learn is a fast system for machine learning. New algorithms are being added with each release, and the consistent API they offer are a notable feature. It offers facilities for classification, regression, clustering, reducing dimensionality and preprocessing. This last is valuable, due to the high volume of data science work that is just getting the data in the desired form.

Scikit-learn implements the interface of the Estimator class, and subclasses have .fit(X, y=None)m .transform() and .predict() methods. Some have .predict_proba() methods. The "algorithm cheat sheet" is an attempt to show how touse the package's features to obtain desired results.

Suppose you are interested in the probability that a given team wins a game given the appearance of a particular play i that game. This allows coaches to answer questions like "should we punt on this fourth down?". This is therefore a classification problem, under the assumption that players are independent (which seems to be a ridiculous assumption, but in fact works pretty well(,

The janitorial steps will involve looking at the data. Does it suffer frmo class imballances (e.g. many more wins than losses)? Does it need centering and scaling. How do I split my data set into training and prediction sets?

A first study showed a histogram of # of plays vs. remaining time showed a major spike just before halftime and smaller spikes before the quarters due to the use of timeouts. The predictive technique used was "random forests." Testing the predictions was performed on both true and false negatoves and positives. Calibration is important for predictive forecasts. Sometimes you will need to use domain knowledge to wring information out of the data.

Trey showed a number of graphics, showing various statistics which my total lack of football knowledge didn't allow me to interpret sensibly. He pointed out some of his important omissions, being quite frank about the shortcomings of his methodology. Feature engineering depends on subtleties of the data, and no confidence intervals are given. Putting models into production is difficult, and data serialization has not yet been suitably standardized.

Data science may not bet easy, but it can be very rewarding. Read more from @treycausey (trey dot causey at gmail dot com) or at thespread.us.

Q: Does what you are doing become less effective over time?
A: In the NFL case most teams don't trust statistics, so it is difficult to get coaches to implement stats-based decisions making. In the larger context, if you get really good at predicting a particular phenomenon, you would expect performace to decline over time due to changing behavior.

Q: Are there particular teams who are making visibly good or bad decisions?
A: Yes. It's easy to see which coaches aren't optimizing at all, but even the best teams still have a way to go. The New York Times has a "fourthdownbot" that is fun.

Q: Are you trying to develop mental constructs of what is right and wrong in feature engineering?
A: That's a philosophical divide in the data science communities; some prefer more parsimonious models, which should yields actionable features, others prefer highly parameterized models with better predictive ability.

Q: Do you understand more about the features in your data as you work with it?
A: Yes, all the time, and the new features aren't always obvious at first. This is basically a knowledge representation problem. Sometimes there is just no signal ...

Q: How are you giving the coaches this data?
A: I can't tell you how. Cellphone communications are forbidden on the sidelines.

Q: Are you REALLY good at fantasy football?
A: Fantasy football's scoring system is highly variable; a lot of research shows that some scoring features can't be predicted. Usually FF's busy time is Trey's busy time too, so he tends to play like other people do and say "he's looking good this week."

Thanks to all speakers for their amazing talks, and thanks to the group for hosting me.