We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The interpretation of the results meshes well with subject-area expertise and also suggests a novel understanding of what differentiates players.

# Tag Archives: statistics

# Midwest Start-Up Achieves Rare $1 Billion Valuation

Uptake’s model is to partner with well-known companies in various industries — from construction to mining to aviation — and create software and special algorithms that help these customers collect and understand huge amounts of data. The company is already producing positive cash flow, according to a person with knowledge of the financials who spoke on the condition of anonymity.

Source: *Midwest Start-Up Achieves Rare $1 Billion Valuation – The New York Times*

# Economics Has a Math Problem

Their overview stated that machine learning techniques emphasized causality less than traditional economic statistical techniques, or what’s usually known as econometrics. In other words, machine learning is more about forecasting than about understanding the effects of policy.

That would make the techniques less interesting to many economists, who are usually more concerned about giving policy recommendations than in making forecasts.

# Almost None of the Women in the Ashley Madison Database Ever Used the Site

When you look at the evidence, it’s hard to deny that the overwhelming majority of men using Ashley Madison weren’t having affairs. They were paying for a fantasy.

Source: *Almost None of the Women in the Ashley Madison Database Ever Used the Site*

The question is, how do you find fakes in a sea of data? Answering that becomes more difficult when you consider that even

realusers of Ashley Madison were probably giving fake information at least some of the time. But wholesale fakery still leaves its traces in the profile data. I spoke with a data scientist who studies populations, who told me tocompare the male and female profiles in aggregate, and look for anomalous patterns.

# Statistics Will Crack Your Password

This means that the top 13 unique mask structures make up 50% of the passwords from the sample. Over 20 million passwords in the sample have a structure within the top 13 masks.

via Statistics Will Crack Your Password.

Based on analyzing the data, there are logical factors that help explain how this is possible. When users are asked to provide a password that contains an uppercase letter, over 90% of the time it is put as the first character. When asked to use a digit, most users will put two digits at the end of their password (graduation year perhaps)

# Fixing Steam’s User Rating Charts

By contrast, the current ranking system leads to the popular becoming more popular — once you’re on the top charts, you have increased visibility, which leads to more reviews, which further cements your chart position (as long as you stay inside your semantic rating bucket).

Those of us who want to discover hidden gems really need the search functionality to work with us, not against us. We want a system where the top charts are self-correcting, rather than self-reinforcing. Otherwise we get a situation like Apple’s with frozen charts, shady tactics, and skyrocketing user acquisition costs.

# Bayesian Prediction for The Winds of Winter

Predictions are made for the number of chapters told from the point of view of each character in the next two novels in George R. R. Martin’s \emph{A Song of Ice and Fire} series by fitting a random effects model to a matrix of point-of-view chapters in the earlier novels using Bayesian methods. {\textbf{SPOILER WARNING: readers who have not read all five existing novels in the series should not read further, as major plot points will be spoiled.}}

via [1409.5830] Bayesian Prediction for The Winds of Winter.

# Instrumental Variables Methods

Estimating causal impacts is fraught with difficulty. Even randomized trials are imperfect, in part because we can seldom, if ever, conduct true experiments (though experimental design is still the gold standard of statistical research). IV is one of the more compelling quasi-experimental methods of estimating impacts, largely because the assumptions needed to justify the IV method are often more plausible than those needed to justify other methods, such as regression.

via The Urban Institute | Toolkit | Data Methods | Instrumental Variables Methods.

# RStudio – About

RStudio provides open source and enterprise-ready professional software for the R statistical computing environment. We started RStudio because we were excited and inspired by R. RStudio products, including RStudio IDE and the web application framework RStudio Shiny, simplify R application creation and web deployment for data scientists and data analysts.

via RStudio – About.