Monday, March 5, 2012

College Baseball Minor League Equivalencies

The following is a paper I wrote in college about college baseball MLEs. Please note that this data was collected before the NCAA bat changes took place.


In 1985 the father of sabermetrics and current Red Sox advisor, Bill James, introduced a new concept in his annual Baseball Abstract book.  What James invented was a new way to measure and evaluate minor league players’ performances.  He came up with the idea of Major League Equivalencies (MLEs) in which he adjusts for a number of external factors such as run environment, park factors, and level of competition.  For example, James found that on average a player loses about 18 percent of his offensive production when moving from Triple-A to the Major Leagues.  These translations are not predictions of what the player will do in the Major Leagues but rather an indicator of what he has done and how the player performed.   Bill James believed this to be some of his most important and influential research breakthroughs.[1]  Having reliable translations of minor league performance may change how front office decision makers decide whom to acquire or promote.  Since James first wrote about MLEs many others have worked to duplicate and improve his work.  There are now translations for Japanese leagues and all levels of the minor leagues.  There are also some translations for pitchers, but these results are prone to much greater error than hitting stats.
The objective of this research was to take Major League Equivalencies to the next level.  Specifically, to introduce NCAA Division 1 college statistics and determine minor league translations to figure which statistics correlate and translate best into the professional ranks.  There is simply too much noise and variation for one to translate college stats all the way to the major league level.  However, once the numbers are translated into the minor league levels it is conceivable that these numbers could then be adjusted using MLEs.  I set out to translate college stats into Rookie level equivalencies and Low-A level equivalencies because these are the two levels most college draft picks end up playing their first year. 
The particular statistics I was interested in observing were walk rate (BB%), strikeout rate (K%), isolated power (ISO) and weighted on-base average (wOBA).  A player’s walk rate is simply his total number of walks divided by his plate appearances.  The strikeout rate is the total amount of strikeouts divided by at-bats.  Isolated power is measured by a player’s slugging percentage subtracted by his batting average.  It is a simple yet effective way to capture a player’s true power.  Weighted on-base average is a linear weights statistic designed to measure total offensive performance[2].  It is weighted to an on-base percentage scale, which means at the professional levels .330 is around the average.  I used the formula (slugging percentage plus 1.75 times on-base percentage) divided by three to estimate wOBA.  
The website www.boydsworld.com has an in-depth database of college baseball statistics which I used to obtain the college numbers.  The first step was to apply park and strength of schedule factors for all players.  To do this, I used a methodology used by writer and researcher Kent Bonham in a series of online articles he wrote.  To apply the park factor I multiplied the three-year weighted Park Factor for each team by the square root of 100 divided by the PF.  Then to apply the strength of schedule rating I multiplied the SOS-number for every team by the square root of the SOS divided by 100.  Basically a Park Factor over 100 implies the team played in favorable hitting environments while a PF under 100 suggests the opposite. Likewise, a strength of schedule rating over 100 indicates the team faced an above-average level of difficulty and under 100 means a weaker schedule was played.  These neutralizing
  I then took all college position players drafted in the first 30 rounds of the 2009, 2008, and 2007 drafts who acquired at least 50 plate appearances in Rookie level or Low-A and college the same year they were drafted.  I decided to use only same-year statistics because this would eliminate noise and other variables such as player improvement in skill, strength, or age.  The next step was to equalize the plate appearances since players rarely accumulate the exact same amount of plate appearances in college and the minor leagues the same year they were drafted.  To do this one must create a “plate appearance factor” to multiply for each player.  I divided each player’s plate appearances in college and the minors for that season, using the factor to weight all statistics according to the lesser amount of plate appearances.  So if a player had 125 plate appearances in college but only 100 in the minors that same here I would weight all of this college stats by .8 (100 divided by 125) so that all things were held equal.  This ensured that the total plate appearances would be the same for sample groups.
My rookie-level sample consisted of 222 players and the Low-A sample had 324 samples.  With sample sizes this large I felt good enough to move forward with the research.  I had originally intended to use five years of draft data but I noticed that with three years the numbers had already begun to flatten out and I did not think any more data would be necessary.
To create these factors I first had to sum up all statistics.  These statistics included plate appearances, at bats, hits, singles, doubles, triples, home runs, total bases, slugging percentage, walks, hit by pitch, strikeouts, ground into double plays, on-base percentage, sacrifice flies, sacrifice hits, stolen bases, stolen base success rate, batting average on balls in play, isolated power, walk percentage, strikeout percentage, walk to strikeout rate, and weighted on-base average.  This was done with four different data sets; rookie level stats, NCAA-rookie stats, Low-A stats, and NCAA Low-A stats.  The totals were added up for each respective draft class and then the minor league numbers were divided by the collegiate numbers to create our factors for both the Rookie level and Low-Single A level (please see attached spreadsheets for specific results).
The factors are for a typical player and are not representative of each individual’s skill.  Naturally some players will over perform and underperform the factors but they are designed as a guideline for the average division-1 collegiate player’s transition into the minor leagues.  These factors intuitively make sense, Rookie level players lose less of their offensive value going from NCAA Division-1 than Low-Single A players because the competition increases at each level of the minor leagues and Rookie leagues are the lowest rung of the minors.  According to my factors, Rookie level players lose roughly 16 percent of their offensive value (wOBA factor of .842) while Low-A players productions are reduced by roughly 27 percent (wOBA factor of .727).  Walk rates remain fairly consistent at each level with a Rookie factor of .879 and a Low-A factor of .866.  Strikeouts rates are increased at both levels by factors of 1.313 and 1.378 respectively.  One interesting observation was the dramatic decrease of power experienced as a player transitions from college to the minor leagues.  A typical player will lose roughly 33 percent of his isolated power jumping to the rookie leagues and about 45 percent going from the NCAA to Low-A.
In order to test my results I ran a regression of some of my predicted statistics against the actual minor league numbers.  I decided to run regressions for the key statistics I mentioned on page two at the Low-A level.  To do this I multiplied each of the sample’s college statistics by the respective factor.  This gave me my x-variable, the predicted results.  I than ran excel regressions against their actual minor league statistics for that season (please see attached for excel regressions).  The results were encouraging.  For strikeouts rates I got an equation actual K%= 15.248 + .389(predicted K%).  The t-statistic was significant (8.00) and the p-value was below .05 so the result is statistically significant.  The regression equation for walk rates was actual BB%= 5.22 + .489(predicted).  The T-stat was over seven and the p-value was below .05 indicating the result was significant.  For ISO the equation was actual= .057+ .516(predicted) and again the values were highly significant.  The equation for wOBA was .212 + .200(predicted) and the results were significant (t-stat of 2.6 and p-value of .009).  I have also included charts of the actual versus predicted statistics.
There are some causes for concern in this study.  First off, there is selection bias in separating the samples into Rookie and Low-A pools.  It is often the case that better players are drafted and placed in higher levels (usually Low-A, but sometimes Single-A) while weaker picks might go straight to the rookie leagues.  The numbers seem to support this argument as rookie players posted an average of a .420 weighted on-base average in college while Low-A players averaged 35 points higher for a cumulative average of .455.  This would indicate a gap in talent level between the two groups and may skew the numbers accordingly.  Also, by using 30 rounds of draft picks we see wildly dramatic results in the minor league levels.  Typically, higher draft picks will outperform others especially at the lower levels of the minor league system because they are more advanced and more talented than their counterparts.  Using 30 rounds as a baseline may be too wide a gap and a further study may reduce that number.  There also could be a problem with using 50 plate appearances as a baseline.  Baseball statistics accumulated in only 50 plate appearances are not incredibly reliable and subject to extremely large confidence intervals because the sample is rather small.  In the future it may be beneficial to increase the minimum amount.  Also, some of the NCAA data was missing some entries.  Not every school tallies sacrifice hits and sac flies or stolen base attempts which may alter the numbers somewhat.
Something that could be interesting for a future study is to include collegiate pitchers.  Pitching is, by nature, subject to more variability to due injuries and other uncertainties and this may be difficult for a researcher to overcome but I think I’ve laid out a baseline for which one to follow.  Similar methods could be applied to pitchers to create factors as well.










[1] http://baseballanalysts.com/archives/2004/11/abstracts_from_20.php
[2] http://www.insidethebook.com/woba.shtml

1 comment:

  1. Randomly came across this when searching for NCAA translations. Excellent stuff; keep working at it.

    ReplyDelete