How to pass successfully in the NHL

Introduction

Sportlogiq has provided an interesting data set to the data community in Montreal. They use computer vision technology to analyze sports by tracking players and categorizing their movements. The output of their algorithms break down a hockey game into a sequence of events: faceoffs, passes, dekes, body checks, shots on nets, goals, etc. The data that they provided represents the equal-strength (non-power play) events of the 6 games of the Ottawa vs. Montreal playoff series in 2015. Please contact them directly for more information about the data set.

One thing that is particularly interesting about the data is that it tracks the passing in the game. There are many statistics collected about hockey that are used to analyze games, teams, and players. But passing statistics beyond assists are typically not amongst them. Tracking passing is labour-intensive for humans because it happens so frequently within each game. However, Sportlogiq’s computer vision technology makes pass tracking a computer automated task. So, this technology can potentially open up passing as a dimension of hockey statistics. I am not aware of any other hockey statistics organizations that can provide detailed passing data (please let me know).

In this article, I would like to do some exploratory analysis of the passing data from this series of games to see what insights we can gain. The data tracks both successful and failed pass attempts. Therefore, we can hope to gain some insights about what ingredients make for a successful pass in the NHL. This information may be of interest to players, coaches, and fans.

The R scripts used for this analysis are avialable on github. The data is only available through Sportlogiq.

The Data

The data represents the equal-strength (non-power play) events of the 6 games of the Ottawa vs. Montreal playoff series in 2015. Each game is broken down into a sequence of events, including pass attempts. For each event in the game, the following features are supplied by Sportlogiq:

  • Event ID
  • Period (1st, 2nd, 3rd, or 4th for overtime)
  • Enumerated possession numbers
  • Which team possesses the puck
  • Enumerated play within the current possession
  • Whether the event breaks the current posession
  • Which frame of video the event happens at (can be used for timing information)
  • What type of event it was (about 100 event types are tracked representing passing, faceoffs, carrys, dekes, shots on net, body checks, passes and others). Passing events are also broken down into types (see below).
  • Which zone the event occurred in (defensive zone, neutral zone, offensive zone)
  • Whether the event was successful or failed
  • The coordinates of the player when the event occurred
  • The player responsible for the event, including first name, last name, jersey number, team, and position

The following plot shows the passes broken down by type. Descriptions of each type are:

  • d2d – Pass from one defenseman to another
  • eastwest – Cross-ice pass to a teammate in the offensive zone
  • north – Pass to a teammate around the boards in the offensive zone
  • outlet – Forward pass in the defensive zone
  • rush – Pass immediately following controlled entry into the offensive zone
  • slot – Pass to a teammate positioned in the slot
  • south – pass in the offensive zone from one side of the ice to the other
  • stretch – Pass from the defensive-zone to a teammate positioned on the other side of centre ice
Successful and failed passing broken down by pass type and position.
Successful and failed passing broken down by pass type and position. Click for full size.

The 8 rink images show the origin locations for successful (green) and failed (red) passes for each pass type. In these plots, one can see clusters of red or green points indicating regions where the pass is more or less likely to be successful. For example, passing into the slot from behind the net is not likely to be successful, but passing into the slot from the wings has a relatively good chance of success.

The mosaic plot at the bottom shows the frequency of each pass type as the width along with the proportion of successful and failed passes of that type. Each type of pass shows a different success rate. This plot tells a story of risk and reward in transferring the puck to a teammate. “D2D” passes where one defenseman passes the puck to another are very low risk and are often unchallenged by the defending team. However, these passes are also relatively low reward because the defensemen are not typically in good scoring position, and often these passes happen in the defensive zone. In contrast, passes into the slot (the area immediately in front of the defending team’s net) are very high risk and high reward. A successful pass into the slot will place the puck in excellent scoring position. Therefore, these types of passes are fiercely defended against and are often not successful.

Feature Engineering

In addition to the above, we can compute some additional features of the passing events that can be useful for understanding passing in hockey.

  • We can approximate the destination position of the pass as the coordinates of the next event in the sequence. In general, the true destination position of the pass isn’t recorded, so the location of the next event in the game is a heuristic, but in most cases the next event will be near where the puck ended up as a result of the pass. It is the best that we can do with the given data.
  • From the above approximation, we can approximately compute:
    • Distance of the pass
    • Speed of the pass using the frame information to determine the time between the past and the next event
    • Distance to the net of the pass destination position
  • We can approximate the amount of time the passer controlled the puck by looking at the amount of time between the previous event and the pass.
  • We can record which game the pass occurred in
  • The distance to the net from the pass’s origin position

It is important to note that events occurring during power plays have been removed from the data set in a way that is not transparent. Therefore, I can not always tell whether two neighbouring events are separated by a series of power play events or whether they are true neighbours. In some cases, the timing information provided by the video frame numbers is used to identify this problem, but in other cases it goes undetected.

Insights on Successful Passing

Who are the best passers?

We expect that there may be some all star passers who always complete a pass and some mediocre passers with a less impressive success rate. The following plot shows the passing success rate by player (successful passes / pass attempts). The error bars are one standard deviation assume that passing is a Poisson process, which turns out not to be quite true. So, a more thoughtful analysis may be needed for some purposes, but this approximation suffices for our exploratory analysis. The green region shows the series success rate for all players. There is a strong correlation with successful passers on the right being defensemen and goalies and less successful passers on the left being forwards. This is explained by the different pass types that players in each position will attempt, as described above. From this plot, we can see that we don’t have enough statistics from a single series to identify the best passers, particularly taking into account their positions.

Passing success rate by player
Passing success rates by player. The green band represents the series passing success rate for all players.

Similarly, there is no significant difference in passing success between the two teams.

Predicting successful passes

From the data, we can learn the ingredients of successful passes and how different features contribute. For our purposes, we will create a predictive classification model using a random forest with 1000 trees and 4 variables randomly sampled at each split. For our purposes, we will only use information that would be available to the passer at the time of making the pass. For example, our measures of pass distance and pass speed use information from future events, but are within the control of the player making the pass. The information about whether the pass attempt is possession breaking is information from the future that the player does not have at the time of deciding to make the pass.

The random forest model can show us the relative importance of the various features of our data set in terms of how strong they are at predicting the outcome. The following plot shows that pass type (whether the pass was defenseman to defenseman or into the slot) is the most important feature of the original data set. The variable d2netNext represents the distance to the net of the destination of the pass. This is the most important of the engineered features.

Variable importance measures from random forest
Variable importance measures from random forest.

The variable descriptions are as follows (in many cases approximations and heuristics are used where the data is not available):

  • type – The type of pass as discussed in “the data” section
  • d2netNext – The distance between the destination of the pass and the nearest net
  • passDist – The distance between the origin and destination of the pass
  • netdist – The distance between the origin of the pass and the nearest net
  • xPos, yPos – The coordinates of the origin of the pass
  • passSpeed – The speed of the pass (distance divided by time)
  • zone – Whether the pass was in the defensive zone, neutral zone or offensive zone
  • playDuration – How long the player controlled the puck prior to passing

Other variables mentioned above but not appearing in the plot were found to have low predictive value and were not incorporated into the predictive model.

It is useful to examine some of these variables in more detail. Consider the ‘best’ engineered variable, the distance between the pass destination and the net:

    Histogram for the distance between the pass destination and the net for successful (green) and failed (red) passes.
Histogram for the distance between the pass destination and the net for successful (green) and failed (red) passes.

The above plot shows that passing the puck to a position close to the net is more difficult than passing to a position far away from the net. This we can understand in terms of the risk and reward of scoring position and intensity of defence.

The following plot represents pass distance:

Histogram of passing distance for successful (green) and failed (red) passes.
Histogram of passing distance for successful (green) and failed (red) passes.

This plot seems to show a ‘sweet spot’ for passing distance. Passing between 10 and 40 percent of the rink length seems to produce a better chance of success. Shorter and longer passes may represent more desperate situations where the passer has less control. A long hail-mary pass may have less chance of success, and a very short pass may represent a player who is swarmed by the defence and has few options for moving the puck.

The following plot represents the pass speed:

Histogram representing pass speed.
Histogram representing pass speed.

This plot may show a slight advantage to faster passing, but the effect is too small to be significant. The spikes at very slow passes are likely due to power plays which have been removed from the data and can create an apparently long period separating a pass from the next event. This would be interpreted as a very slow pass for the purposes of this plot.

Overall, the predictive model can correctly classify 77.4% of the passes as successful or failed (classification accuracy). This is a somewhat disappointing result considering that predicting that every pass will be successful gets one to 71%. One important problem is that this analysis has been blind to the locations of other players besides the passer. This information is certainly a primary consideration to a player who is deciding whether to pass. To do better, we should incorporate information about the positions of the defenders who are likely the causes of most of the failed passes. It is likely that Sportlogiq’s technology can track the defensemen’s positions, but that data is not included in the data set that was made available.

For players deciding whether to attempt a pass or for coaches training players how to pass successfully, knowing the most likely outcome of the pass is not as useful as being able to weigh the risks against the rewards in each particular situation.  In the figure below, the lower left corner represents a player who assumes that all passes will fail (indicating that the player will never pass the puck). The top right corner represents a player that assumes every pass attempt will succeed (indicating that the player will take every possible opportunity to pass to a teammate who is in better scoring position). The curve connecting these points shows different levels of risk that a player could adopt within the predictive model and what fraction of successful and failed passes they will attempt at that level of risk.

Rate of true success predictions vs. Rate of false success predictions
Rate of true success predictions vs. Rate of false success predictions (i.e. the true positive rate vs. the false positive rate for this classification problem)

Conclusions

This has been a preliminary analysis of a relatively small data set representing only 6 games between the same two teams (Montreal and Ottawa). Only even strength play time is included. It is unclear how well the results will generalize to other games or to NHL hockey as a whole. However, the main insights that we can get from this analysis are:

  • Passing ability does not vary enough from player-to-player or team-to-team to see an effect in this data. The passing situation is much more important than the individuals involved. With data from more games, it may be possible to determine who are the most skilled passers and receivers on a team or in the league.
  • Passing the puck to a teammate who is near the net or in the slot is very difficult, but passing between defensemen in the defensive zone is relatively easy. This makes sense because the area in front of the net is where a player has the best scoring opportunity, so it is heavily defended.
  • Passing into the slot from the wings appears to have a better success rate than passing into the slot from behind the net.
  • There appears to be a sweet spot for passing distance between 10% and 40% of the length of the rink. Longer and shorter passes are more likely to fail.
  • Contrary to popular wisdom that says passes should be hard, we do not see a strong benefit to high speed passing in our data.

Moving Forward

The most important factor toward understanding passing statistics is to include the important context provided by the positions of all players on the ice at the time the pass is made. My analysis has been totally blind to this. The present data set provides glimpses of where individual players are when they are the focus of one of the tracked events. This could be used in principle to model and estimate where players are between the figurative radar blips. It would not be straightforward to model speeds and trajectories from these blips given the fast pace of hockey. However, it may be possible to do this with the current data set. Ultimately, it would be helpful to know the positions of each player for each frame of video. This would provide useful information for understanding the eligible pass receivers on the offensive team and the potential pass interferers on the defensive team. I am not sure if the current machine vision technology is capable of providing this information, but I think it is likely to exist in the future.

Currently, we are limited to data from just six games played within a short time frame between two teams. There are a number of things that may emerge from having more data from more games and more teams. We could determine who are the best passers and how much variation there is in passing and pass receiving skill within the NHL. We could compare passing rates and passing success rates between different teams and see how this leads to better plays and more goals. We could see how teams and individual players improve at passing with time and experience. We could also see how well our conclusions listed above generalize to other games and other teams.

I welcome feedback. Please let me know if I have made any mistakes or missed any thing.

Sentiment analysis gone awry

Prooffreader has recently published an ‘attitude analysis’ of tweets related to feminism. One of his claims is that sentiment, as measured by traditional sentiment analysis techniques, is a poor predictor of attitude. To make this claim, he had a team (of humans) hand-classify 1000 tweets into one of three categories: pro-feminist, anti-feminist, or neither. The full analysis is available on Github, and it includes this plot:

Sentiment analysis cannot separate attitudes
Sentiment analysis cannot separate attitudes

This plot shows that there is basically no relationship between the sentiment of a tweet and whether or not the tweet is pro or anti feminist. This is an important lesson for organizations that use sentiment analysis to gauge the public’s attitude about their brand. What’s interesting is to look at the specific tweets that have positive sentiment but anti-feminist attitude and those that have negative sentiment but pro-feminist attitude. Prooffreader gives these hypothetical examples of the latter case:

    1. Man, do I ever hate feminists.

2. I hate that my mom does not like the word ‘feminism’.

The data set contains cases showing real-world break down in usefulness of basic sentiment analysis when the real goal is to capture attitudes. In all of the following examples, the attitude was determined by human classifiers, but the sentiment by the textblob package for Python.

Examples of anti-feminist attitudes expressed using positive sentiment

Any men shedding a tear at the end of #page3 , make sure you bottle them. #Feminists get thirsty….

I liked a @YouTube video http://t.co/r6xWBbR1ER Re: Potty-Mouthed Baby Feminists

Feminist article: don’t call me “working mommy” because we don’t call men “working daddy.” So? My MOM title is my most esteemed title. ~Mom

@DoeringNorman @Oneiorosgrip let’s be accurate here norm- when dealing with feminists- polite or not if you disagree,you are their enemy

Wait.. So the feminists siding with amber rose and slandering the lil kardashian for being in a relationship with a pedophile?

Slate Sex Trafficking Post Proves Feminists Can\u2019t Be Happy Even When You Agree With Them http://t.co/VvbbiR9Vk3 #tcot

Feminist gon kill Chuck lmao

@officialncc1701 ALRIGHT I GUESS COULD BE BETTER IF THEY WOULDNT SPOIL MY TV SERIES AND NOT ARGUE ABOUT FEMINISM WITH ME

@Sepko01 @kav_p Someone needs a pat on the shoulder. “It’s ok, bud. Just sit still and the feminists can’t see you.”

@elliottkrista @HisFeministMama The more you dumbass Feminists fight amongst yourselves, the more the MRAs laugh at you. Don’t you get that?

I just realized, the way that the anti-gamers and feminists promote Towerfall is a perfect allegory to their crumbling narrative. #GamerGate

@ONLINEPIXI3 but the “feminists” will love it.

I’m sure that over 100 feminists have me blocked.

#ThankYouEllenPao you have proven that women and minorities can be in power and can achieve amazing things and didn’t need #feminism for it.

Lol see these women are trolling, no other explanation “@Ntiana_: So real quick does feminism throw culture out the window ?   Yes or No”

So many otherwise intelligent people completely lose their minds when it comes to feminism. So many parallels between that and religion.

@Ashykneesirwin Okay so if you don’t feel ashamed, why do you need feminism? To stop men from making jokes about women?

be peddled in the name of feminism. I am glad we all could see those nuances. However, where is this vigilant mass when it comes to 3/n

twitter feminism in a nutshell lol http://t.co/8g3MhdOGEq

So many feminists on twitter so keen on protecting the rights of people who can pop out babies they think it’s OK to trample my boundaries.

@RaasAFC @Faisal_RedDevil Tbf what we said was true. And we shut her up. Feminism is a disgrace, mostly so for women like yourself.

too many attention whores on instagram all in the name of feminism

we need meninism more than feminism because men are the ones who found this world so \u2615\ufe0f

@Niki8954 I admire your optimism, but in my experience, feminism is about convincing women that they’re more oppressed than they are.

Examples of pro-feminist attitudes expressed using negative sentiment:

‘Isn’t feminism when u hate females’ – my brother, 2015

@siIentwords @honestfandom HeForShe is about gender equality/feminism. He wouldn’t have gone if he was against it. What’s so wrong with him+

It’s sad. People are so arrogant and insecure that they think feminism is about not needing a man.

Why would you date a boy who thinks feminism is stupid

@jessepurtell and you should not say you hate all feminists because one video you saw of a women being disrespectful.

Real #feminist will show support for @RealCytherea. This Violent Gang Rape I’d atrocious.  http://t.co/5nnhEr3HUt via @returnofkings

I don’t understand why feminists are being ignored or hated upon, they’re fighting for something that has been ignored for decades????

@BigFashionista @Steve___Miller I hate when anyone does this! Feminism is NOT the root of all evil. Discrimination is!

@PrisonPlanet Snowden still works for Booz/CIA, if you weren’t so obsessed with bashing feminism and Islam you might realize that, dumbass.

@drunkonjbxcbxag @tropicaljustxn @Kayfeminist @Grxnde__Butera she has to be a troll account trying to be a feminist but insulting females+

@Eyebrows_13 ctfu I explained what feminism was and she was like “but all the women I see are angry at men, don’t shave, are spelly hippies”

The lesson from this is that a sentiment analysis that is too simplistic does not provide reliable information about attitudes toward a certain topic.

How MathWorks hurts research: A review of Matlab Distributed Computing Server

TLDR: MathWorks uses a cumbersome and error-prone DRM called MDCS to control the restrictions on their Parallel Computing Toolbox software for Matlab. The MDCS product is an abuse of the vendor lock-in that occurs when researchers invest years developing Matlab-based research code in an environment with gratis access to Matlab supplied by institutional licenses, and are subsequently squeezed for additional licensing fees for their (ostensibly) already licensed product when they need to run their code in parallel at scale.  Research is harmed by lost time and/or unanticipated costs incurred directly because of Mathworks’ aggressive and unethical licensing strategies for MDCS.

 

Introduction

Imagine that you are a young researcher who invests a great deal of time learning Matlab and creating Matlab codes for your research. The costs of using Matlab are covered by an institutional license provided by your University, or company so that you don’t have to think about them. As you grow as a researcher you tackle harder and harder computational problems that require greater and greater parallel computational power. After working very hard to develop your parallel code using Matlab’s Parallel Computing Toolbox software, you are ready to tackle your problem at scale: 20 cores, 100 cores, 1000 cores, 10,000 cores, depending on the hardware available to you, the nature of your research question, and the capabilities of your competitors. But, you encounter a problem that you didn’t expect: your institution has not paid to run parallel Matlab code at large scale and are unwilling to purchase the necessary licenses. You discover that you must pay unexpected costs of doing research with Matlab, or go through the tremendous efforts of porting your code to another technology.

The issue is MathWorks’ attitude toward distributed computing. Their attitude tells them that the cost of the software license should increase with the high performance computing capacity of their product. They have opted to go the route of per-worker licensing fees and technological restrictions on their product, both of which hurt the researchers who use their products and their research. While MathWorks may have identified an opportunity to create revenue by selling greater computational capacity, the additional costs and inconvenience to researchers does not reflect any additional development effort or cost to Mathworks. Research is pointlessly held hostage by Mathworks’ restrictive licensing policies.

I hope that this product review of Matlab Distributed Computing Server (MDCS) can be useful to everyone who makes decisions related to research computing software, both individual researchers who choose their development platforms and those responsible for institutional licenses. The issues with MDCS should not only be considered by people currently considering purchasing or using MDCS, but by any Matlab user who believes that their Matlab code may someday be used on a distributed computing platform. Moreover, many of the problems that I point out with MDCS are not unique to MDCS or Matlab, and some are fairly common and accepted software licensing practices that should nevertheless be questioned and criticised as they can have important impacts on research computing.

 

 

Who is paying for Matlab?

In my opinion, one of the core issues that has lead to this problem is the question of who pays for Matlab software licenses. In some cases, this cost is paid directly by the researchers who work directly with Matlab. However, in many (possibly most) cases, users are covered under various types of institutional licenses paid for by Universities, Colleges, Businesses and other organizations. The result is that the people who are responsible for the research, are often unaware of the costs of choosing to do that research with Matlab. Many institutions do not make the details of the license or the costs involved known to the researchers. From the researchers’ perspectives, the costs are covered and their use of Matlab is gratis. Often, the institutions negotiate licenses that contain many toolboxes and sometimes infinite simultaneous licenses for Matlab products. For researchers, this can create a harmful impression of freedom and obfuscate the costs and restrictions associated with using MathWorks products.

And so, students and young researchers invest their educational efforts into gaining Matlab skills and their research efforts into developing Matlab codes, unaware that they are doing something very expensive that may ultimately restrict their capacity to do research. Naturally, MathWorks encourages researchers to take advantage of these institutional licenses by promoting themselves on University campuses, and organizing free training/marketing workshops for their products. The result is a large number of research careers and research programs that are worryingly dependent on MathWorks products. So, there is a vendor lock-in problem. And then Mathworks abuses this lock-in with MDCS.

 

Licensing options for parallel Matlab

Most computational research programs grow in scope and eventually evolve to a point where the problems become too large to solve with serial code, and parallel computation capabilities become necessary. So, what options are available for a Matlab user to address this reality?

The first step is to invest in Matlab’s Parallel Computing Toolbox. This toolbox is currently listed on Mathworks’ website for $1000 on top of your existing $2150 standard individual Matlab license. The parallel toolbox gives you easy-to-use, high level constructs for unleashing parallelism in your Matlab programs. This is a good product if you don’t mind the price tag and the proprietary restrictions. Purchasing the toolbox does increase Matlab’s capabilities beyond the standard version, and the toolbox represents real development effort on MathWorks’ part to create the high-level parallel constructs and tools.

The drawback of the Parallel Computing Toolbox is that it contains digital restrictions that only permit it to operate with 12 cores or fewer on a single computer.

The reason for this limitation is not related to the technical implementation of the parallel toolbox. Currently, MathWorks uses an MPI distribution called MPICH2 (version 1.4.1p1). MPI (Message Passing Interface) is an open standard for distributed, parallel computing, and MPICH2 is an open source implementation of that standard which is released under a license which permits its (gratis) use for commercial, closed-source software products like Matlab. Matlab uses MPICH2 to implement the parallelism in the parallel toolbox, and this free software is designed with scalability on multi-node computer clusters in mind.

The reason for the technical limitation of 12 cores in the Matlab Parallel Computing Toolbox is a business decision by MathWorks to charge licensing fees based on the amount of parallel computing capability their users wish to unlock. They may have chosen other ways to inconvenience users into paying additional fees: restrictions based on maximum memory, restrictions based on maximum lines of source code, restrictions based on maximum number of variables used. A maximum number of workers is equally as arbitrary but is somehow easier to market in a culture where we have been trained to confuse software with private material goods that can’t be trivially copied as many times as we need.

MDCS does not really add functionality to Matlab so much as it partially removes technical limitations in the Parallel Computing Toolbox software. The Matlab Parallel Computing Toolbox (a pre-requisite product for using MDCS) contains the technology to provide scalable parallel computing capability to as many nodes as one can afford to put into a cluster, but this capability is intentionally restricted with technical measures. The design, and the purpose of MDCS is to limit the parallel scalability of Matlab software, not to provide it (MPICH2, the free software that it is built on, provides it). Therefore, I refer to MDCS as digital restrictions management (DRM) software.

The costs of MDCS licenses are not publicly available on the Mathworks website and are only available by requesting a customized quote. To give a ballpark idea, licensing for hundreds of workers (i.e. a medium sized computer cluster) will cost tens of thousands of dollars, with thousands of dollars of ongoing annual fees.

 

Is MDCS good software?

Unfortunately, in designing a product whose primary design goal is the enforcement of licensing restrictions, MathWorks has compromised their product’s usability and introduced a number of unnecessary and frustrating points of failure. The product that researchers deserve from MathWorks is a Parallel Computing Toolbox with no licensing or technical restrictions on the number of nodes or workers that can be used, so that Matlab users simply have a high-level interface to the MPI libraries. The MDCS product falls short of this hypothetical product not only in having restrictions on the number of workers that can be used, but also in terms of usability and stability.

With MDCS, you do not log in to the HPC cluster and submit your jobs as you would with other software. Instead, you open Matlab on your laptop or desktop computer, read the HPC site’s documentation on how to configure your computer to submit to the cluster, download integration scripts and install them onto your computer, set-up metadata folders both on your computer and on the cluster’s file system, and finally you are able to submit jobs from Matlab, provided that you did not make any mistakes. This is a poor software model for an HPC environment because the configuration, job logs, and metadata used by the software is distributed between two different systems controlled by two different groups of people (the HPC system, and the submission system). This model not only ensures that the number of points of failure is increased, but also that when a problem arises, investigating and solving it requires a greater effort coordinated between multiple parties. MDCS results in a significant overhead in labour to operate compared to a hypothetical Matlab product with an unrestricted Parallel Computing Toolbox.

MDCS uses metadata about the submitted jobs that are stored both on the cluster’s file system and on the user’s personal computer. This metadata must be synchronized between the two different systems. If the user wants to submit jobs from a different computer, or from a different version of Matlab on the same computer, this can cause corruption of the metadata. If the user submits jobs to a second MDCS system, the user must carefully manage two separate sets of integration scripts, and must also be careful to avoid corruption of the metadata. The corrupted metadata does not produce straight-forward errors, but rather strange behaviours and the presentation of misinformation to the user, and it is not always obvious that a problem has occurred. Some of these issues can be alleviated by taking special care to set up separate metadata folders for each combination of computer, Matlab version, and cluster that the user wishes to use. To switch between different sets of metadata folders, the user has to modify information in two places: the integration scripts for the target cluster, and the cluster profile. These metadata issues are not yet documented by Mathworks, and it is up to users to discover them by trial and error.

MDCS is also difficult to maintain from the HPC staff’s perspective, compared to ‘normal’ software. To modify some aspects of the software configuration on the cluster, simultaneous changes must be made to the MDCS configuration on the HPC system, as well as the integration scripts living on separate hard drives of many users. It can therefore be very difficult for HPC staff to deliver a reliable experience to researchers because improving the MDCS configuration may mean breaking their workflow of every user and forcing them to upgrade their local configurations before their jobs can run.

For many users, any type of distributed computing can seem complex and error-prone relative to regular desktop computing. An unrestricted Matlab Parallel Computing Toolbox could be an accessible entry point to distributed computing for many Matlab users, in addition to being a high-productivity research platform. It would not have the problems that I have described above and it would even require less development effort from Mathworks. However, instead of making distributed computing less complex, and less error-prone, MathWorks have done the opposite with MDCS.

The above-described problems with MDCS’s design makes it a poor choice for users of other programming languages looking for distributed computing tools, so I assume that these users aren’t the target market for MDCS given the high quality of available alternatives and the high costs of MDCS. Rather, MDCS is a product designed to extort money from locked-in Matlab users.

 

What else can researchers do?

There are a few workarounds that researchers have developed to try and do parallel computing without incurring the restrictions of Matlab’s MDCS licensing model. One method is to install an MPI distribution such as MPICH2 onto a computing cluster, and then compile wrappers to the MPI function calls using Matlab’s mex compiler so that they can be called from your Matlab program. Finally, using an institution’s many (possibly infinite) non-parallel Matlab licenses, one can launch many separate Matlab tasks that have the capability to communicate through MPI.

This technique works, but there are significant drawbacks. Software built this way is unable to use Matlab’s wonderful debugging capabilities on the parallel system. It is also unable to fully use other parallel debugging tools because the debuggers will be unable to see inside of Matlab’s proprietary binaries. Programs using MPI can be very hard to develop, debug, and maintain, and neither Matlab’s tools nor the tools used by MPI developers will work properly on the Matlab+MPI Frankenstein’s monster that has been created. Nevertheless, this is a common solution adopted by frustrated researchers.

Another technique is to use a cluster’s parallel file system to coordinate communication between separate non-parallel Matlab tasks instead of the network. Not only does this have the development, debugging, and maintenance problems discussed above, it also has slow performance because the file system is much slower than the network on a computer cluster.

Since Matlab contains technical limitations to limit its parallel scaling capacity and the only available products for parallel computing are the parallel computing toolbox and MDCS, any creative methods that a researcher might devise for achieving parallelism with Matlab outside of these official products is at risk of being seen by MathWorks as circumvention of a technical limitation. If so, they could argue that these techniques are unlawful under the infamous anti-circumvention provisions of the USA’s Digital Millenium Copyright Act or similar laws in other countries.

Finally, researchers who cannot continue their research because of Matlab licensing restrictions may choose to port their code to another programming language. This means redoing much of the development work that went into creating the research software in a different language that offers more freedom. Porting code can represent an incredible investment of labour that could be invested in research instead, but this effort may be well worth it for the potential freedom it can bring. There is a free software project called GNU Octave which attempts to be compatible with the Matlab programming language, presumably a response to the many ex-Matlab users who wish to compute with freedom while reducing the costs of porting their code to another system.

 

Conclusion

 

Mathworks uses a free software product to provide parallel scaling capacity for Matlab Parallel Toolbox, and then charges additional fees for the removal of arbitrary restrictions on that capacity through MDCS. The MDCS product is not appealing as a distributed computing platform because it is error prone and unnecessarily complex, resulting in considerable labour overhead in using it. The market for this product is not researchers looking for a general-purpose distributed computing platform, but researchers who are already locked-in to Matlab and for whom it is the only option. The costs of MDCS are very high relative to the costs of Matlab, the parallel computing toolbox, or the free software library that provides its parallel scaling capabilities. However, its costs are not transparent to researchers planning for a long-term software project. They are a secret that is only revealed to them through a customized quote at the time they are ready to run their software at larger scales. This makes planning in advance for the costs of distributed computing with Matlab something that is rarely done in practice, forcing researchers to absorb unanticipated expenses late in the software development cycle.

It is a fact of research computing that problem sizes are becoming larger and distributed computing platforms are becoming more and more common and accessible. The marketing for MDCS indicates that you can scale up to cluster computing without the expense and hassle of changing your code. This is a reminder from MathWorks to researchers to think ahead when writing your research software and choose technologies that scale to address the modern (distributed!) reality of research computing. You do not want to get burned by choosing the wrong technology and discovering that you can’t run your code on the same problem sizes as your competitors, or are unable to make use of the new parallel computing hardware that is always becoming available. With Matlab, the software scales, thanks to MPICH2, but the licenses do not. If you are a researcher, you should think ahead and choose a programming environment where the technology scales without the expense and hassle of changing your code, or your license.

Ultimately, I propose that the solution to this be that researchers and organizations that perform research divest in software products that arbitrarily restrict the capabilities of research code for want of more and more licensing fees. If there must be licensing fees, they should be tied to things that represent real development effort from the vendors. They should not represent the removal of arbitrary restrictions such as maximum number of characters in a source file, maximum number of workers, or anything else that might be dreamed up to inconvenience users into paying more.

From MathWorks, I would like to see the technical restrictions on the Parallel Computing Toolbox lifted so that it can scale with the full power of the free MPI distribution that fuels it. This model would improve the usability and stability of Matlab on distributed computing systems by removing restrictive DRM, and would serve to avoid damage done to research efforts for those who have chosen to use Matlab products. It would also reduce development costs on MathWorks part because the development of MDCS and restrictions on the parallel computing toolbox would not be necessary.

I understand that Mathworks and their products exist to make money, and I am not necessarily advocating for reducing the overall costs of parallel computing with Matlab for researchers or institutions, or recommending that MathWorks reduce their total revenue: they can charge a fair license fee for Matlab and an additional fee for the Parallel Computing Toolbox. What I am advocating for is the removal of harmful proprietary restrictions from the parallel toolbox that limit its ability to scale to multiple nodes and many cores, and the abandonment of an unethical business strategy that exploits researchers through vendor lock-in.

How Long Does a PhD Take at UBC?

How long does a PhD take?

The question is critically important to current and prospective graduate students. The answer affects their ability to plan their educations, personal lives, personal finances, and careers. Deciding to do a PhD is a little bit like signing a cell phone contract that says how much you owe every month, but not when you get to stop paying. The nature of research means that the duration of a PhD is very uncertain. But, you would be foolish to start one without first learning about how long they take, at least on average. I was foolish like that.

Graduate programs attract students by competing with each other to offer graduate students appealing funding packages. But, graduate students can find themselves racing against the clock to finish their programs before these attractive offers run out. In a UBC PhD program, students who remain in their program after 4 years will find that they no longer have access to the four year fellowships, tri-council funding, or preferred TA hiring that attracted them to school in the first place. I was foolish like that. Perhaps it is necessary to cut students off so they don’t linger in the cushy student life forever. But, it is important to see if 4 years is a realistic goal that most students are actually able to achieve, or if the 4 year funding cliff is an unrealistic expectation that harms typical students for taking typical amounts of time to finish their program.

It is a very legitimate question to ask of an educational institution. Of course, the nature of research means that the length of a PhD is very uncertain. That’s to be expected. But prospective and current PhD students should have a realistic expectation of how long their degrees typically take. How many people take 4 years? How many take 5 years? How many take 6? Obviously, no one takes more than 6 years, UBC has a rule about that. It turns out that students and professors alike have a lot of misconceptions about the realities. It doesn’t help that UBC treats this information like its some kind of big secret.

I don’t know how long PhDs take at UBC. But, I did ask a lot of people, sent a lot of emails, did a lot of reading, analysed some data, and even made Freedom of Information Act requests. After all that, I had a small number of answers that didn’t agree with each other. I also had (I think) a number of people annoyed with me for asking too many questions that (I guess) students aren’t supposed to ask.

Asking around

I was involved for two or three years (depending on how you count) in my department’s prospective-student-wooing open house. If those are any indication, “How long do graduate degrees take” is being asked by a large fraction of prospective students coming to UBC. The journey you are currently reading about started because I was disappointed with the answers they were getting. I heard a lot of answers from professors along the lines of “2 years for Masters and 4 years for PhDs, although, obviously, a few slackers take 5 years or more.” I didn’t think that was quite consistent with reality.

In fact, it isn’t quite true. But, I don’t think the professors are being intentionally deceptive. I think they selectively remember their all-star students who wrap up in 3 years, and then make excuses for why they shouldn’t include people who take a long time in their average. Student X didn’t work hard. Student Y had health problems. Student Z had to work part time to feed her family. Obviously, those aren’t ‘average’ students who should be included in the average. The problem is that the students are just asking, when they should be saying “show me the data.”

Show Me The Data

Eventually, a friend of mine discovered that UBC actually had some graduate student completion data up on the PAIR website. It has since been placed behind a password (thanks to me: read on, dear reader), but here it is. Since this is a story about how hard it is to find out how long PhDs take, I should mention that the data was in the form of an Excel spreadsheet with 31832 entries instead of something sensible like a database. If you ever want to frustrate someone into not looking at large volumes of data, use Excel.

Here is a description of the data from a representative of FOGS:

The spreadsheet you have been looking at represents a study undertaken using a subset of Admission cohorts – that is taking a subset of each admission years new intake and tracking them to completion. The report you are looking at amalgamates the 1995 – 2003 Admission cohorts and presents the outcomes of these students. For the Outcome, students may have Graduated, Left UBC, Transferred to another program, Still be in Program or are Unknown if we don’t know what’s happened to them.

This data would easily answer my question, except that it has apparently been corrupted. Among lots of other information, it contains three dates: Program start date, Program end date, and graduation date for each student. The amount of time a student spends in their program is the Program end date minus the Program start date. I noticed something funny about this quantity: About 30% of PhD students in the represented cohorts completed their programs in 2 years or less and not a single person in those cohorts completed their program in between 2 years + 1 day and 3.5 years. About 20% of every student who graduated finished in exactly 2 years to the day. There’s a very suspicious gap in the data between 2 years and 3.5 years. I asked FOGS about it. They responded:

[The students who finish in 2 years or less are] likely those that have come from other institutions to complete their studies at UBC or in very structured programs – but the majority graduate in the 4-7 year range (400; 559; 469; 246).

… If you break the TIP Years by CIP Division (the Statscan area of program study) you’ll see that the earlier graduants tend to be clustered in Education and Social Sciences. Social Sciences includes Psychology, a Department that streams students from Masters thru Doctoral levels which helps them complete in a timely (often quicker) fashion…

…It makes sense that very few students would complete in the 3rd year. The earlier 0-2 group are likely either transfers or in uniquely structured programs (ie: Education) where completion may occur more quickly than anticipated. The average time to completion (graduation) is just over 5 years.

Okay, it all makes sense now. Its Education and Psychology students. Except it’s not. Here’s the breakdown for the PhD students who finished their programs in 2 years or less:
10 business & management
129 Education
160 Engineering
75 Health Sciences
23 Humanities
13 Professional
182 Science
96 Social Science

487 Domestic
195 International

Only 12 transferred from other institutions

This group of students is pretty statistically typical for UBC as a whole. They aren’t more likely to be education students as engineers or scientists or anything else.

The students aren’t demographically unusual in any way, but there is something highly unusual about them as a group: They wait an incredibly long time to graduate after their programs end. Sometimes, they wait almost 10 years. Here is a plot where the vertical axis is how long the student waited to graduate after the end of their program ([Grad date] minus [program end date]) and the horizontal axis is time in program ([program end date] minus [program start date]).

The students who complete their programs in less than two years are looking very conspicuous.

Here is the histogram of the two groups of students. Homework: Use your favourite statistical technique to discover how likely it is they were sampled from the same distribution.

Use your favourite statistical test to discover whether these two groups of students are sampled from the same distribution.

I showed these plots to the above mentioned representative of FOGS and the response from UBC was to not answer me and place the data behind a password immediately so that students could no longer get at it. (No worries, I put it here on the web.)

Freedom of Information

I had hit a roadblock with FOGS and still didn’t know how long PhDs took. But UBC had a report from 2010 with the title “Graduate Student Completion Rates and Times.” Perfect. I asked for it along with another report. Apparently, it is top secret:

I’ve talked with the Dean’s Assistant and neither reports are available for public readings as they contain privileged information.

Of course, when someone from the public requests personal or top secret information, employees of public institutions are required by the Freedom of Information and Protection of Privacy act to inform them that an FOI request must be made to obtain the information. But, I guess they forgot. I made one anyway.

[Spoiler: It will turn out that neither report was top secret.]

FOI Requests and responses:
Request sent March 30, deadline May 17
1. Graduate Student Completion Rates and Times, March 2010, including all appendices

  • Report was given June 5.

2. UBC Faculty of Graduate Studies 2011 External Review Self-Study, including all appendices

  • Report available on the web. Appendices delivered on July 9th, except appendix 25 disclosure of which may have harmed the business interests of a third party

3. Any draft versions of the report Graduate Student Completion Rates
and Times, March 2010

4. Any raw data used in preparing the report Graduate Student
Completion Rates and Times, March 2010, in machine-readable format
such as a spreadsheet.

  • “We were informed that draft versions of this report no longer exist in electronic or hard copy formats, and that the database containing the raw data for this report (drafts and final copy) is no longer available. Recreating that compiled data…would be extremely time consuming and would unreasonably interfere with our operations.”

[UBC: If you lost the data, don’t even fret about it. You can just download it again here.]

5. All records related to the CUPE 2278 union’s bargaining to extend
the preferred hiring preference for TAs. Date range is January 1, 2012
to March 30, 2012.

  • 30 day extension requested on June 5
  • 535 pages of responsive emails were identified, mostly emails between UBC staff. These were withheld in their entierty because all of them contained advice, recommendations, legal advice, would harm the right to personal privacy or would harm the financial or economic interests of a public body. Fair enough. I didn’t get them on July 9.

6. All records related to graduate student completion rates and times
in the date range January 1, 2012 to March 30, 2012

  • “We have been informed that creating these records would involve compiling and analysing data, which would be extremely time consuming and would unreasonably interfere with our operations” June 5

All that work and all I got was some lousy appendices. Well, I also got the report that I wanted. The only really useful thing about it is that it is full of interesting plots like this:

How many students graduated by a certain time in program. The colours are different groups of students, but you’ll have to guess what they are because the report doesn’t say.

Because FOGS lost the data used to compile this report, I have no way of knowing if it is affected by the problems I discovered above.

So, How Long Do PhDs Take?

It depends who you ask. It may be 4 or 5 years if you ask the professor-on-the-street. It may be just over 5 years on average if you ask FOGS. But, I’m writing this, so I guess we are asking me (i.e. I’m asking me).

I don’t trust FOGS’s data for the reasons outlined above. I believe that there is a problem with the program end date column. Dates in this column may be getting “toggled” prematurely (at or before 2 years) for about 30% of students in the cohort. I don’t know the reason, but the evidence is in the plots above. We can estimate the average completion time by avoiding this column for the corrupted group of students. [Grad date] minus [Start date] is one way, but it doesn’t account for the months that students spend after they finish waiting for a graduation ceremony to happen. We can account for this time by subtracting off the average of ([Grad date] minus [Program end date]) for the uncorrupted group (students who took more than 2 years to graduate). Here is the histogram compared to the one for (Program end date – program start date):

Top: Time in program histogram as prescribed by FOGS
Bottom: Time in program histogram as prescribed by the author
The mean is about 6.5 months greater in the bottom histogram

The average time in program according to this is about 5.5 years. This is about 6.5 months longer than the bizarre looking FOGS data would claim. You are welcome to believe FOGS instead of me, but it could cost you about $700 in tuition and a few months of your youth. I was foolish like that. Keep in mind that there are well known variations according to what type of program students are in. By the way, programs where students finish quickly are correlated with programs where students are well funded. I’ll let you think about that.

Conclusion

Is a four year funding model reasonable? At UBC, 78% of PhD students need more time. Is a 6 year hard limit on PhDs reasonable? At UBC, 27% of students need to write a letter explaining why circumstances beyond their control resulted in them being unable to complete their degree in a timely manner. Those percentages only include the students who actually graduate. As we can see in FOGS’s data, many more end up leaving for various reasons.

I have the following requests for my various audiences:

Professors and Staff: Please download the graduation data, analyse the outcomes of students who graduated from your program and post the results where current and prospective students can get it. Then, everyone will have answers tailored to their department.

Graduate students: Please ask for the data when you ask “how long will my graduate degree take?”. Don’t let anyone tell you the answer in words because they really don’t know. Definitely don’t believe any program length indicators that involve someone giving you money (scholarship lengths, preferred TA hiring, etc.).

UBC Representatives: Please return the graduation data to the people of BC. I’m really sorry that I plotted it. In my defence you trained me for way too many years to compulsively plot things. Will you forgive me? Funny looking data is much better than none at all, and people would rather get it from you than from me. Keep new data flowing all the time. This information affects people’s lives. Also, if you have any answers to the questions I’ve raised, please join the conversation in the comments below.

Teaching Computational Physics

Starting in September, I will be teaching first year undergraduate students computational physics. Well, that’s the plan, at least.

Mark Guzdial has posted a sobering series of blog posts that highlight how difficult achieving that plan is going to be. Mark’s blog posts summarize PhD research performed by Georgia Tech graduate Danny Caballero.

We experimented with a computational component in this course last year as one aspect of a series of major reforms. Our goal was to provide a small taste of how computer programming can be used in physics problem solving because this idea is emphasized in their textbook. However, we didn’t want to get distracted from our main goal: teaching physics. We provided the students with working vPython programs and problems that could be solved by making minor and obvious changes to certain lines of code, or changes to initial conditions, for example. We collected feedback from the students that indicated that no one was satisfied with this treatment: most students were not interested in programming and were frustrated at being forced to work with code they didn’t understand while the students who did have an interest in programming were frustrated at the lame, unchallenging activities they were presented with.

This year, we have decided to make a commitment to teaching programming, giving more challenging tasks that will satisfy the students who are interested in learning programming, while providing proper instruction in programming ideas so that (hopefully) no one feels the code they are working with is mysterious or scary.

In the first blog post, Mark explains why our students are less likely to learn the physics content of the course. Basically, by spending time on teaching programming, we have less time to spend on the physics, and the physics naturally suffers.

Of course, the main goal of the course is the teach physics, but it is a mistake to think that should be the only goal. It is especially a mistake to think that the point of the course is to achieve high student performance on the Force Concept Inventory. This course is an enriched physics course where most of the students are expected to go on to major in physics where they will take many more physics courses. This is very different from most first year physics courses which, for most students, are the last time in their lives they will receive any physics instruction. In these courses, it is reasonable to be concerned that your students have not become Newtonian thinkers. But a student who majors in physics will have Newtonian ideas reinforced over and over for several years, and so there is perhaps less pressure to ensure they are thinking Newtonianly at the end of their first course.

What is important for these student, in the context of their entire degree, is that they become well trained in all three pillars of physical reasoning: theory, experiment, and numerical computation. In my own undergraduate experience, I spent many hours each week developing myself as a theorist, and many hours each week in the lab developing myself as an experimenter throughout the entire duration of my degree. Numerical computation was relegated to a single course in a single semester taught by the math department without any explanation for how numerical analysis might be used by physicists. So, we would like to fix this problem and instill students with the idea that being able to solve problems by programming a computer is as important to your training as a physicist as being able to analytically solve problems, or make careful experimental measurements. Like analytical problem solving, or experimental methods, numerical techniques should be introduced early and reinforced throughout the entire duration of the degree program.

So, it is good to teach numerics to first year students, even if it temporarily sets back their education in physics concepts like Newtonian mechanics. The important thing is finding the right balance, and using the time you spend well.

The third blog post in the series, has even more troubling news for us. Caballero developed an instrument to assess the student’s attitudes toward computational modelling. This instrument is a series of questions which probe various aspects of attitudes, and the student answers are compared with answers given by experts at computational modeling. So, it produces a measure of how “expert-like” the students have become over the course of instruction. But, the results from this instrument are negative.

In particular, students after instruction had less personal interest in computational modeling, agreed less with the importance of sense-making (the third bullet above), and agreed more with the importance of rote memorization (last bullet above).

These results remind me of how students attitudes toward physics and science often shift to becoming less expert-like after most physics courses. Learning is shaped like a U, with an initial decline in performance as knowledge is getting restructured.

If some negative progress early on is a normal part of learning computational physics, perhaps we shouldn’t be too concerned, given that this isn’t the only course they will take. After all, the grand plan is to keep developing each student’s ability at computational physics regularly over the course of 4 years. So, if the bottom of the U occurs after the first semester, there is still plenty of time to spend on getting the students to climb up the slope of the other side of the U over the next three years of computational physics instruction. By the end of that, they should be more expert-like than when they started school, even if they experience a dip somewhere in the middle…at least according the the U-shaped learning idea. If most of the students in a course will never do another physics course, leaving them at the bottom of the U is probably a bad idea. But, with students who are majors, the final state of the student brain is what it looks like at the end of the entire degree, not at the end of the first semester of instruction.

Unfortunately, it may be a long time before we can really start thinking about the efficacy of entire degree programs in a truly evidence-based way. What PhD student is going to want to research a whole 4-year degree for their thesis project?! For the moment it is very frustrating that in trying to move forward, you have actually taken a step backward and have to look for excuses that make you feel good about knowingly taking that step backward.

Now, I’m going to take a close look at the types of errors made by the students Caballero worked with and figure out how they might inform our computational physics instruction. I’ll post more as the brilliant insights come to me.

The Big Steps of Evolution

We can show evolution taking small steps in the labs. But evidence for large evolutionary steps is unfortunately a bit rarer. This leads some non-experts to deny that small evolutionary steps can compound into large changes over time. So, it’s interesting to explore some of the important evolutionary milestones in the lab. One of those milestones is single celled organisms -> multicellular organisms. New Scientist is reporting that this step has been observed with Yeast in the lab.

Yeast clumps

Sure enough, within 60 days – about 350 generations – every one of their 10 culture lines had evolved a clumped, “snowflake” form. Crucially, the snowflakes formed not from unrelated cells banding together but from cells that remained connected to one another after division, so that all the cells in a snowflake were genetically identical relatives. This relatedness provides the conditions necessary for individual cells to cooperate for the good of the whole snowflake.

After a few hundred further generations of selection, the snowflakes also began to show a rudimentary division of labour.