Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball

We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The interpretation of the results meshes well with subject-area expertise and also suggests a novel understanding of what differentiates players.

Source: [1706.10272] Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball

Inside Major League Baseball’s “Hypothesis Machine”

Baseball data, over 95% of which has been created over the last five years, will continue to mount—leading MLB decision-makers to invest in more powerful analytics tools. While there are plenty of business intelligence and database options, teams are now looking to supercomputing—or at least, the spawn of HPC—to help them gain the competitive edge.

via Inside Major League Baseball’s “Hypothesis Machine”.

Please.  The problem with current baseball analytics isn’t the deluge of data, it’s the deluge of crackpot theories that add more and more irrelevant variables to the mix.  Most baseball analytics misuse mathematics and created by people who are simply selling a website.

Speaking of selling a website; is this a good place to introduce the sister site to bucktownbell.com?  🙂

baseball.brandylion.com

All data in above data model crunched using perl,awk, and bash on a standard PC.  Baseball is not that complicated where it requires a supercomputer to crunch historical or current season data.  More  from the article…

He explained that what teams, just like governments and drug development researchers, are looking for is a “hypothesis machine” that will allow them to integrate multiple, deep data wells and pose several questions against the same data.