Apache Spark has become a common tool in the data scientist’s toolbox, and in this post we show how to use the recently released Spark 2.1 for data analysis using data from the National Basketball Association (NBA). All code and examples from this blog post are available on GitHub.
Analytics have become a major tool in the sports world, and in the NBA in particular analytics have shaped how the sport is played. The league has skewed towards taking more 3-point shots due to their high efficiency as measured by points per field goal attempt. In this post we evaluate and analyze this trend in the NBA using season statistics data going back to 1979 along with geospatial shot chart data. The concepts in this post -- data cleansing, visualization, and modeling in Spark -- are general data science concepts and are applicable for other tasks beyond analyzing sports data. The post concludes with the author’s general impressions about using Spark and with tips and suggestions for new users.
For the analyses, we use Python 3 with the Spark Python API (PySpark) to create and analyze Spark DataFrames. In addition, we utilize both the Spark DataFrame’s domain-specific language (DSL) and Spark SQL to cleanse and visualize the season data, finally building a simple linear regression model using the spark.ml package -- Spark’s now primary machine learning API.
Finally, we note that the analysis in this tutorial can be run with a distributed Spark setup running on a cloud service such as Amazon Web Service (AWS) or on a Spark instance running on a local machine. We have tested both and have included resources for getting started on either AWS or a local machine at the end of this post.
The Code
Using data from Basketball Reference, we read in the season total stats for every player since the 1979-80 season into a Spark DataFrame using PySpark. DataFrames are designed to ease processing large amounts of structured tabular data on the Spark infrastructure and are now in fact just a type alias for a Dataset of Row.
We can also view the column names of our DataFrame:
Using our DataFrame, we can view the top 10 players, sorted by number of points in an individual season. Notice we use the toPandas function to retrieve our results. The corresponding result looks cleaner for display than when using the take function.
yr
player
age
pts
fg3
1987
Jordan,Michael
23
3041
12
1988
Jordan,Michael
24
2868
7
2006
Bryant,Kobe
27
2832
180
1990
Jordan,Michael
26
2753
92
1989
Jordan,Michael
25
2633
27
2014
Durant,Kevin
25
2593
192
1980
Gervin,George
27
2585
32
1991
Jordan,Michael
27
2580
29
1982
Gervin,George
29
2551
10
1993
Jordan,Michael
29
2541
81
Next, using the DataFrame domain specific language (DSL), we can analyze the average number of 3-point attempts for each season, scaled to the industry standard per 36 minutes (fg3a_p36m). The per 36 minutes metric provides an estimate of a given player’s stats projected to 36 minutes, an interval corresponding to an approximate full NBA game with adequate rest, while also allowing comparison across players that play different numbers of minutes.
We compute this metric using the number of 3-point field goal attempts (fg3a) and minutes played (mp).
Alternatively, we can utilize Spark SQL to perform the same query using SQL syntax:
Now that we have aggregated our data and computed the average attempts per 36 minutes for each season, we can query our results into a Pandas DataFrame and plot it using matplotlib.
We can see a steady rise in the number of 3 point attempts since the shot's introduction in the 1979-80 season, along with a blip in number of attempts during the period in the mid 90's when the NBA moved the line in a few feet.
We can fit a linear regression model to this curve to model the number of shot attempts for the next 5 years. Of course, this assumes a linear nature of the rate of increase of attempts and is likely a naive assumption.
Firstly, we must transform our data using the VectorAssembler function to a single column where each row of the DataFrame contains a feature vector. This is a requirement for the linear regression API in MLlib. We first build the transformer using our single variable `yr` and transform our season total data using the transformer function.
We then build our linear regression model object using our transformed data.
yr
fga_pm
fg3a_pm
features
label
1980
13.49321407
0.410089262
[1980.0]
0.410089262
1981
13.15346947
0.3093759891
[1981.0]
0.3093759891
1982
13.20229631
0.3415114296
[1982.0]
0.3415114296
1983
13.30541336
0.3314785517
[1983.0]
0.3314785517
1984
13.14301635
0.3571099981
[1984.0]
0.3571099981
Next, we want to apply our trained model object model to our original training set along with 5 years of future data. Containing this time period, we build a new DataFrame, transform it to include a feature vector, and then apply our model to make a prediction.
We can then plot our results:
Analyzing Geospatial Shot Chart Data
In addition to season total data, we process and analyze NBA shot charts to view the impact the 3-point revolution has had on shot selection. The shot chart data was originally sourced from NBA.com.
The shot chart data contains xy coordinates of field goal attempts on the court for individual players, game date, time of shot, shot distance, a shot made flag, and other fields. We have compiled all individual seasons where a player attempted at least 1000 field goals attempts from the 2010-11 through the 2015-16 season.
As before we can read in the CSV data into a Spark DataFrame.
We preview the data.
yr
name
game_date
shot_distance
x
y
shot_made_flag
2011
LaMarcus Aldridge
2010-10-26
1
4
11
0
2011
Paul Pierce
2010-10-26
25
67
246
1
2011
Paul Pierce
2010-10-26
18
165
83
0
2011
Paul Pierce
2010-10-26
24
159
186
0
2011
Paul Pierce
2010-10-26
24
198
148
1
2011
Paul Pierce
2010-10-26
23
231
4
1
2011
Paul Pierce
2010-10-26
1
-7
9
0
2011
Paul Pierce
2010-10-26
0
-2
-5
1
2011
LaMarcus Aldridge
2010-10-26
21
39
211
0
2011
LaMarcus Aldridge
2010-10-26
8
-82
23
0
We can query an individual player and season and visualize their shots locations. We built a plotting function plot_shot_chart (see the GitHub repo) that is based on Savvas Tjortjoglou's example.
As an example, we query and visualize Steph Curry's 2015-16 historic shooting season using a hexbin plot, which is a two-dimensional histogram with hexagonal-shaped bins.
The shot chart data is rich in information, but it does not specify if the shot type is a 3-point attempt or a corner 3. We solve for this by building User Defined Functions (UDF), which identify the shot type given the xy coordinates of the shot attempt.
Here we defined our shot labeling functions using standard Python functions utilizing numpy routines.
We then register our UDFs and apply each UDF to the entire dataset to classify each shot type:
We can visualize the change in the shot selection over the past 6 years using all of our data from the 2010-11 season up until the 2015-16 season. For visualization purposes, we exclude all shot attempts taken inside of 8 feet as we would like to focus on the midrange and 3 point shots.
Over the years, there is a notable trend towards more three-pointers and fewer midrange shots.
Finally, we evaluate shot efficiency as a function of shot distance.
We then plot our results.
Among the top scorers in the league, close 3-point attempts are among the most efficient shots in the league, on par with shots taken close to the basket. It's no wonder that accurate 3-point shooting is among the most coveted talents in the NBA today!
Conclusion
Lastly as a seasoned data scientist, SQL user, and Python junkie, here are my two cents on getting started with Spark. The Spark ecosystem and documentation are continually evolving and it is important to use the newest Spark version. A first time user will notice there are multiple ways to solve a problem using different languages (Scala, Java, Python, R), different APIs (Resilient Distributed Dataset (RDD), Dataset, DataFrame), and different data manipulation routines (DataFrame DSL, Spark SQL). Many choices are up to the users and others are guided by the documentation. Since Spark 2.0 for example, the DataFrame is now the primary Spark API for Python and R users (rather than the original and still useful RDD). In addition, the DataFrame-based spark.ml package is now the primary machine learning API in Spark replacing the RDD-based API. Bottom line: the platform is evolving and it pays to stay up to date.
In this post, we’ve demonstrated how to use Apache Spark to accomplish key data science tasks including data exploration, visualization, and model building. These principles are applicable to other data science tasks and datasets, and we encourage you to check out the repository and try it on your own!
Additional Resources
Install Spark on a local machine (Mac OS X Yosemite)
AWS and PySpark with Anacona - Quick Start
Install Spark on AWS