Winter for this guy has little to do with a functioning CBA. Winter is for research. Winter is for reading. Winter is for Retrosheet.
What is Retrosheet?
Retrosheet is an organization that collects play-by-play accounts over every major league baseball game. They’ll take those accounts, verify them, and use a uniform scoring standard to code them in computer format. They make their efforts available for free.
From their website, https://www.retrosheet.org/about.htm.
“Retrosheet’s work has three distinct aspects. First is the collection of the game accounts, which have been obtained from several sources. The bulk of the collection has come from the major league teams, who have allowed us to make copies of their play-by-play accounts. Several sportswriters (most of them retired) have allowed copies to be made of their daily scorebooks. In addition, individual fans have donated copies of games they scored at the park or on their own scoresheets at home. The second activity is the translation of these accounts to a unified, modern system which is essential since there are an extraordinary variety of scoring systems which have been used. The final activity is the entry of the translated accounts into the computer.”
They have collected, verified, and made available computerized play-by-play accounts of over 170,000 games since 1901. Their data includes play-by-play data for EVERY GAME since 1973. They also have play-by-play data for All-Star games and playoff games.
Their hard work benefits nearly every professional baseball organization. Any place that regularly consumes baseball data sites uses Retrosheet data as a starting point. Baseball-Reference is one example.
They are not limited to play-by-play data.
They have provided a game log for every game since 1871. The game logs will contain team statistics, winning and losing pitchers, attendance, game conditions, line scores, umpires for most of those games. They also have game logs for All-Star games and playoff games.
Additional data sets that are made available include schedules, transactions, ejections, and biographical information about every player.
Their website also includes box scores for every baseball season and may provide detailed game descriptions, such as https://www.retrosheet.org/boxesetc/1920/B08160NYA1920.htm.
INDIANS 5TH: Chapman was hit by a pitch; Chapman tragedy; LUNTE RAN FOR CHAPMAN; Speaker forced Lunte (second to shortstop); Smith struck out; Gardner singled to right [Speaker to third]; O’Neill singled to right [Speaker scored, Gardner to third]; Johnston forced O’Neill (second to shortstop); 1 R, 2 H,
0 E, 2 LOB. Indians 4, Yankees 0.
There is a treasure of baseball facts, history, and trivia waiting for discovery for baseball fans. The hardest part is getting to the point where you can use it. I will focus on the play-by-play data for this article. A future report will include information and usage examples of other data available from Retrosheet.
You Said Something About Play-By-Play Data
For a clear picture of the Retrosheet sheet scoring system, go to the source https://www.retrosheet.org/eventfile.htm. I’ll attempt to summarize. Each game gets an ID. That ID is made up of three letters to determine the home team, then a four-digit year, two-digit month, two-digit day, and then either 0, to represent a single game, 1, to represent game one of a doubleheader, or 2, to represent the second game of a doubleheader. When using the data, you’ll need the game ID if you want to get date information, so later, when you see “substr(game_id,4,4),” I am just getting the year.
There are tags to identify the game’s starters and substitution in the game. There are tags to determine the pitcher and batter, runners on the bases, and players on the field. If the information is available, umpires and their locations get tagged. Attendance figures are labeled if available.
Plays recorded include what happened to the batter and runners. Who fielded the ball on the defense. How were the outs scored? What was the trajectory of the ball? Line drive? Pop Up? Fly Ball? Groundball? There is a way to add additional comments that might not fit precisely into the scoring system. Pitching sequences, strikes, balls, pitchouts get recorded if known. You’ll find events that don’t involve the batter. Anything known about the game will find a place in the event files. And yes! If you want to create your one event file, you can! Want to digitize your little league games? You can!
From the event play-by-play files, you can gather roster information, team information, a list of substitutions, etc. Retrosheet also uses their IDs for players, but the roster information makes it easy to map those IDs to names.
Tools are needed to convert the files into data that you can use. You can write your own, but Retrosheet offers software tools to use their files https://www.retrosheet.org/tools.htm. I recommend using Chadwick https://github.com/chadwickbureau/chadwick. It can take quite a bit of work to generate all the data for each year. I will leave that as an exercise for you if you choose to do it. I have done the hard work for you.
I created a database you can download and use to make things easy. You’re welcome.
Tools You’ll Need and It’s All Free!
But first a word about the tools I will be using. In this article, I will be using simple, freely available tools. There are better ways to do things, but I hope to make using this data easy for people. I have taken the Retrosheet data and converted it for use in a database. While many databases are available, I will use SQLite (https://www.sqlite.org). SQLite is freely available for Windows, Mac, and Linux systems. To use my database, you’ll only be required to download the sqlite3 program. There are better tools and better ways to do things. My hope here is to give people a starting point. To obtain sqlite3 on Windows, go to https://sqlite.org/download.html and download SQLite-tools-win32-x86-3370200.zip or any latest version that may be available. Then unzip them into a directory, say c:\baseball.
If you are not familiar with SQL, I recommend finding a good resource on it. There are many free courses, tutorials, and books available to learn it. While I use SQLite for this article, it uses SQL. If you need help finding a resource, please contact me. I’m here to help!
For my examples, I’ll be using Linux as my operating system. If you use a Mac, you can download SQLite from https://sqlite.org/download.html, or if you use brew, brew install sqlite3 will install it. Need help? I’ll help point you in the right direction.
Next, you’ll need to download my Retrosheet SQLite database available here https://txt.mek.cc/retrosheet/releases/sqlite/ and the current version is https://txt.mek.cc/retrosheet/releases/sqlite/retrosheet-sqlite3_v2022_02_15.zip. This is about a 1Gb download and expands to 5Gb. The current version will always be the latest date. New releases will happen when updated data is available. I always use vYYYY_MM_DD as the version. After you download the zipped database, extract it into a directory. Once again, let’s say c:\baseball.
Note: If you are interested in how I created the database server, that information is here https://github.com/thesiddfinch/retrosheet.
I keep my SQLite binary and my database files in the same directory. I keep the database and a copy of SQLite for Windows, Mac, and Linux on a USB drive.
Now I use Linux, and my prompts will be different than that of Mac and Windows users.
Stand Back, I’m About To Do Stuff
retrosheet=; sqlite3 retrosheet.db SQLite version 3.36.0 2021-06-18 18:36:39 Enter ".help" for usage hints. sqlite> .table events lkup_cd_recorder_method games lkup_cd_recorder_pitches lkup_cd_bases lkup_id_base lkup_cd_battedball lkup_id_home lkup_cd_event lkup_id_last lkup_cd_fld lkup_id_park lkup_cd_h lkup_id_player lkup_cd_hand lkup_id_recorder lkup_cd_park_daynight lkup_id_team lkup_cd_park_field lkup_id_ump lkup_cd_park_precip rosters lkup_cd_park_sky subs lkup_cd_park_wind_direction teams
All the lkup_ tables store information that provides more detailed descriptions for the main tables’ items. The main tables are games, events, teams, subs, and rosters. You can learn more about what is in the table using the .schema command. I’ll expand on them in a future tutorial.
sqlite> .schema teams CREATE TABLE teams ( year_id integer ,team_id varchar(3) ,lg_id varchar(1) ,loc_team_tx varchar(30) ,name_team_tx varchar(30));
The above seems scary if you are new to SQL, but there is a table called teams: year, team, league, team location, and team name.
sqlite> .header on sqlite> .mode column sqlite> select * from teams limit 10;
Note: The output is not nice HTML tables. I’ve created the HTML tables to give a more visually pleasing appearance. Your result will vary but will be plain text.
Generating Some Stats for Players
There is a bit of SQL to select ten rows from the teams’ table. For the focus here, let’s write a short statement to get the 2021 stats for Franmil Reyes in 2021.
First, you need to get the Retrosheet player id.
sqlite> .schema rosters CREATE TABLE rosters ( year_id integer ,team_id varchar(3) ,player_id varchar(8) , last_name_tx varchar 25) ,first_name_tx varchar(25) ,bat_hand_cd varchar(1) ,p it_hand_cd varchar(1) ,team_tx varchar(3) ,pos_tx varchar(5)); sqlite> select * from rosters where last_name_tx = 'Reyes' and first_name_tx = 'Franmil' and year_id = 2021;
The events table has 162 columns, so I won’t print out them here. A partial list of the data available can view here https://www.retrosheet.org/datause.txt.
We have Franmil’s player id, and using that will pull his batting stats. I will output his PA, AB, H, Doubles, Triples, HR, Walks, and Strikeouts.
.header on .mode column select sum(case when pa_new_fl = 'T' then 1 else 0 end) as PA ,sum(case when ab_fl = 'T' then 1 else 0 end) as AB ,sum(case when h_cd > 0 then 1 else 0 end) as H ,sum(case when h_cd = 2 then 1 else 0 end) as DBL ,sum(case when h_cd = 3 then 1 else 0 end) as TRP ,sum(case when h_cd = 4 then 1 else 0 end) as HR ,sum(case when event_cd = 3 then 1 else 0 end) as SO ,sum(case when (event_cd = 14 or event_cd = 15) then 1 else 0 end) as BB from events where substr(game_id,4,4) = '2021' and bat_id = 'reyef001' group by bat_id ;
Don’t be intimidated! If I can figure this out, you can too!
All we do is check if there is a condition. Is this a new PA? What is an AB? If the hit value is greater than 0, there was a hit. That hit was either a single (1), double (2), triple (3), or home run(4). To find out strikeouts, we check the type of event to see if it was a strikeout (3), a walk (14), or an intentional walk (15). If one of the events is confirmed, we output one; otherwise, a zero. We don’t pull all the information, only the information when the year is 2021 (taken from the game id) and the batter (bat_id) is “reyef001”. Instead of spewing all the rows out, SQL will group them, and since we use a sum command, all the ones and zeros get summed together.
The output for the long command is:
Now, with a slight change, let’s do every player’s season stats.
select substr(game_id,4,4) as year ,bat_id as batter ,sum(case when pa_new_fl = 'T' then 1 else 0 end) as PA ,sum(case when ab_fl = 'T' then 1 else 0 end) as AB ,sum(case when h_cd > 0 then 1 else 0 end) as H ,sum(case when h_cd = 2 then 1 else 0 end) as DBL ,sum(case when h_cd = 3 then 1 else 0 end) as TRP ,sum(case when h_cd = 4 then 1 else 0 end) as HR ,sum(case when event_cd = 3 then 1 else 0 end) as SO ,sum(case when (event_cd = 14 or event_cd = 15) then 1 else 0 end) as BB from events group by year ,bat_id ;
We added two new columns, the year and the batter. We told our SQL command to group all the data by our additional columns, year and batter, and sum all the rows. Presto, seasonal stats for every player in the database.
Going back to Reyes again, I want to see how he batted with a runner on third base. We’ll add the run3_dest_id, and record the numbers where run3_dest_id is greater than three. The runX_dest_id value when zero means that nobody was on base. When it is non-zero, it will tell us where the runner on base X ended up (1 for first, 2 for second, 3 for third, 4 for home, 5 for scored and unearned, six scored, and team unearned).
.header on .mode column select sum(case when pa_new_fl = 'T' then 1 else 0 end) as PA ,sum(case when ab_fl = 'T' then 1 else 0 end) as AB ,sum(case when h_cd > 0 then 1 else 0 end) as H ,sum(case when h_cd = 2 then 1 else 0 end) as DBL ,sum(case when h_cd = 3 then 1 else 0 end) as TRP ,sum(case when h_cd = 4 then 1 else 0 end) as HR ,sum(case when event_cd = 3 then 1 else 0 end) as SO ,sum(case when (event_cd = 14 or event_cd = 15) then 1 else 0 end) as BB from events where substr(game_id,4,4) = '2021' and bat_id = 'reyef001' and run3_dest_id > 3 group by bat_id ;
Fun fact, Reyes hit two triples in 2021. Both occurred when somebody was on third base.
For an exercise, you can take the SQL I wrote to generate information for all batters when there is a runner on third base! Get stuck? Ask me.
Oh The Things You’ll Find!
A few tips from me. I have the SQL commands into a file called < Something>.sql. You can then have then run using the SQLite command read. I also add the SQLite command .output to write output to a text file. Note: cat is a command that displays a file to the screen.
retrosheet=; cat 4.sql .header on .mode column .output 4.txt select substr(game_id,4,3) as decade ,inn_ct as inning ,bat_home_id as half ,sum(case when ab_fl = 'T' then 1 else 0 end) as AB ,sum (case when h_cd > 0 then 1 else 0 end) as H ,round((sum(case when h_cd > 0 then 1 else 0 end)*1.0)/(sum (case when ab_fl ='T' then 1 else 0 end)*1.0),3) as BA from events where inn_ct < 3 group by decade ,inning ,half ;
retrosheet=; sqlite3 retrosheet.db SQLite version 3.36.0 2021-06-18 18:36:39 Enter ".help" for usage hints. sqlite> .read 4.sql sqlite> .quit retrosheet=; sed -n 1,30p 4.txt
Here I created a text file, 4.sql, that had the SQL command. I started SQLite and read in the file. The SQL command had an output command. Don’t worry about the sed command, just my fancy way of looking at the first 20 lines of the 4.txt file.
What does this SQL do? It looks at decades, the first three numbers of the year, i.e., 191, and for each half innings records the AB and H, then computes BA. Notice Something odd about the 1910s? The visiting team had a higher batting average than the home team. Yet, the home team has a significantly higher batting average in the other years.
Wait? What! THE DATA IS WRONG! YOU FOOL KOVACH!
Hold your horses! It’s true!
In the 1920s, baseball introduced two significant rule changes. First, they outlawed the spitball except for grandfathered players and allowed them to continue using it. After 1920, after Ray Chapman was hit in the head by a ball and later died from that injury, they started replacing the balls more often. The visiting team had the advantage of the clean ball! Before then, a single baseball was often used for the entire game.
The Only Limitation is Your Imagination
Retrosheet’s play-by-play data records nearly all activity that happens in a game. The awesome people that run the project ensure that the information is as accurate as possible. With the scope of the information and a few simple tools, one has the power to dive deep into the history and story of baseball. While you don’t have the information for every single game, you have the data on many. And you have all the play-by-play data since 1973, or as I call it, the Retrosheet Era.
Want to generate a daily tally of home runs hit on even-numbered days by left-handed batters hitting against right-handed pitchers with last names that started with S? You can. Want to create a run-expectancy matrix based on runners on base? You can.
Throughout the season, I’ll be showing how to use other available free data sources and offer some interesting information I have found in the data.
Featured image by Doug Carlin (@Bdougals on Twitter)