Pitcher List Data Camp: Personal Data Warehouse

In this golden age of baseball analytics, the data resources available to us for our analyses are anything but lacking. But what if we wanted to focus on a more obscure test that would require raw, unfiltered data?

Our focus today will be on generating that through a simple Python script and saving it to our machine. We want to put emphasis on quickness of access, therefore we’ll limit each script execution to a single player and year. The resulting file will contain the Statcast values of every pitch thrown by a pitcher, or faced by a batter, in a season. Now that the plan is set, let’s jump in.

Importer setup

Before we start, make sure the requests package is installed (see https://docs.python-requests.org/en/latest/user/install/#install for instructions).

The first thing we’ll want to do is set the parameters for our importer. Since we’ll be fetching a batter or pitcher’s yearly data, we’ll need a player name and a year.

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Statcast data importer.")
        parser.add_argument(
            "--year",
            nargs=1,
            type=int,
            required=True,
            help="Year to import."
        )
        parser.add_argument(
            "--batter",
            nargs=1,
            type=str,
            required=False,
            help="Batter to import."
        )
        parser.add_argument(
            "--pitcher",
            nargs=1,
            type=str,
            required=False,
            help="Pitcher to import."
        )
        args = parser.parse_args()

        year = args.year[0]
        if args.batter:
            [We'll fill this in later]
        elif args.pitcher:
            [We'll fill this in later]
        else:
            print("Must provide a batter or pitcher.")

Next, we’ll define a function that asks for our selection from a list of players whose names match the one we provided. The reason we’re doing it this way is because BaseballSavant relies on a player’s unique number ID (assigned by Major League Baseball) to generate the data but just like website domain names and their associated IP addresses, keeping track of number assignments is much more challenging than remembering actual words, in this case, the player’s name.

It’s important to note that using common names will result in a bigger list of matches, making the player selection longer if the one we’re looking for is at the bottom of that list. Instead of using the name Jones, opt for something more precise like Adam Jones.

def search_player(pattern):
    url = "https://baseballsavant.mlb.com/player/search-all?search=" + pattern
    with requests.get(url) as r:
        r.raise_for_status()
        matches = r.json()

    if not matches:
        print("No matches found.")
        return False

    for match in matches:
        if not isinstance(match, dict):
            continue

        player_name = match.get("name", False)
        if not player_name:
            continue

        print("=============================================")
        print("Match found: {name}".format(name=player_name))

        user_input = input(
            "Type yes if this is the player you want"
            " or press enter to go to the next match."
        )
        if user_input.strip().lower() == "yes":
            return match

    return False

And now for the most important step: data retrieval. Other than the hfSea, player_id and player_type parameters, we can set our own constraints directly in the code. For example, if we only want the data if the pitcher threw at least 100 pitches, set min_pitches to 100 instead of 0. Or if we only want regular season data, set hfGT to R| (NOTE: the value associated to hfGT must end with |).

def import_player_data(year, player, player_type):
    if not isinstance(player, dict):
        print("Importer player parameter is invalid.")
        return

    player_id = player.get("id", False)
    if not player_id:
        print("Player has no unique id.")
        return

    url = "https://baseballsavant.mlb.com/feed"
    parameters = {
        "warehouse": True,
        "hfGT": "R|PO|",
        "min_pitches": 0,
        "min_results": 0,
        "min_pas": 0,
        "type": "details",
        "player_type": player_type,
        "player_id": player_id,
        "hfSea": "{y}|".format(y=year)
    }
    with requests.get(url, params=parameters) as r:
        r.raise_for_status()
        response = r.json()

    return response

Once that data is received, we’ll want to save it to a csv file. The file we’ll write to will be named after the player and saved in the same directory as the Python file. However, if we want to save it elsewhere or use a different name, simply modify the player_name, filename and/or filepath variables.

def save_to_file(player, player_data):
    if not player_data:
        print("No player data to save to file.")
        return

    player_name = player["name"].replace(".", "").strip().lower()
    filename = "_".join(player_name.split(" ")) + ".csv"
    filepath = os.path.join(os.getcwd(), filename)
    with open(filepath, "w") as csv_file:
        rows = []
        writer = csv.writer(
            csv_file,
            delimiter=",",
            quotechar="\""
        )

        header = player_data[0].keys()
        rows.append(header)

        for data in player_data:
            row = [data[key] for key in header]
            rows.append(row)

        writer.writerows(rows)

Finally, we’ll combine all of those functions into one file. For the sake of not overpopulating this page with code, I’ll only use the names of the functions surrounded by square brackets. Make sure to replace them with the actual function code before running the script.

import os
import csv
import argparse
import requests


[import_player_data]

[search_player]

[save_to_file]


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Statcast data importer.")
    parser.add_argument(
        "--year",
        nargs=1,
        type=int,
        required=True,
        help="Year to import."
    )
    parser.add_argument(
        "--batter",
        nargs=1,
        type=str,
        required=False,
        help="Batter to import."
    )
    parser.add_argument(
        "--pitcher",
        nargs=1,
        type=str,
        required=False,
        help="Pitcher to import."
    )
    args = parser.parse_args()

    year = args.year[0]
    if args.batter:
        player = search_player(args.batter[0])
        if player:
            player_data = import_player_data(year, player, "batter")
            save_to_file(player, player_data)

    elif args.pitcher:
        player = search_player(args.pitcher[0])
        if player:
            player_data = import_player_data(year, player, "pitcher")
            save_to_file(player, player_data)

    else:
        print("Must provide a batter or pitcher.")

Save that file and voilà we’re all set!

Importer usage

Here are a few examples of how to run the importer (make sure to replace [file] with the name of the Python file):

Mookie Betts batting data for 2021: python3 [file] --year 2021 --batter "mookie betts"
Max Scherzer pitching data for 2019: python3 [file] --year 2019 --pitcher "max scherzer"
Madison Bumgarner batting data for 2018: python3 [file] --year 2018 --batter "madison bumgarner"

The data we’re pulling is from BaseballSavant so always remember to quote them accordingly. For the full documentation of what each column in the csv file represents, refer to https://baseballsavant.mlb.com/csv-docs.

Happy hacking to all you number junkies out there!

Featured Image by Justin Paradis (@JustParaDesigns on Twitter)

AL East

AL Central

AL West

NL East

NL Central

NL West

Pitcher List Data Camp: Personal Data Warehouse

AL East

AL Central

AL West

NL East

NL Central

NL West

Pitcher List Data Camp: Personal Data Warehouse

Subscribe to the Pitcher List Newsletter

Your daily update on everything Pitcher List