The intention behind this quick and fun exercise is to improve my SQL skills, and to dive into some baseball statistics.
To begin, I downloaded the SQL version of Sean Lahman’s iconic baseball database that spans the 1871-2023 seasons. You can find the database, and more about Sean Lahman here.
The SQL database file was in a .bak format, which I was unfamiliar
with initially. I learned that this was a common way to export a backup
of a database, and that in order to access and query it, I would need to
restore the database using SQL Server Management Studio. Once this was
completed, I added my local SQL server and the restored database as an
ODBC Data Source. Finally, I used the DBI
library in R, in
order to establish a connection to the database.
# Libraries
library("DBI")
library("tidyverse")
library("gt")
# Establishing ODBC connection to 'SQLserveR'
con <- dbConnect(odbc::odbc(), "SQLserveR")
I am interested in pitching statistics for my favorite team, the
Toronto Blue Jays. Using the query below, I have created a data frame
that contains the regular season pitching data for every pitcher, in
every year that the Jays have existed, minus this past season
(1977-2023.) Included in this dataframe are the pitcher’s first and last
names from the People
table of lahman2023, and a bevy of
pitching statistics from the Pitching
table.
SELECT Pitching.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..Pitching
JOIN lahman2023..People
ON Pitching.playerID = People.playerID
WHERE Pitching.teamID = 'TOR'
ORDER BY Pitching.yearID
Using this dataframe, I wanted to look at two statistics in particular - Completed Games (CG) and Saves (SV).
# Creating a dataframe of CG and SV by yearID
p1 <- pitching_data %>% group_by(yearID) %>% summarize("Completed Games (CG)" = sum(CG), "Saves (SV)" = sum(SV))
# Pivoting dataframe
p1_long <- p1 %>% gather("Statistic", "Count", -yearID)
# Plotting CG and SV by yearID
ggplot(p1_long, aes(x = yearID, y = Count, fill = Statistic)) + geom_col(position = "dodge") + labs(title = "Pitching: Total Completed Games and Saves by Year\nToronto Blue Jays (1977-2023)", x = "Year")
Plotting these two basic statistics allows us to see a clear trend that has been a source of much discussion for baseball fans: the sharp decline in the number of complete games pitched. A “Completed Game” (CG) is fairly self explanatory, it is simply when a pitcher pitches the entirety of a game. A “Save” (SV) is a bit more complicated, as the pitcher must finish the game on the winning team and do one of the following: pitch at least three innings, enter the game with a lead of three or less runs while pitching at least one inning, or enter the game with the tying run in position (on-deck, at plate, or on the bases.)
Taking a look at our graph of completed games and saves per year, we can see the beginning of this trend towards using relief pitchers to finish games. The number of saves for the Toronto Blue Jays pitchers skyrockets in the 80’s and never lets up. In the modern era we begin to see years without a single completed game pitched, and when they happen they are newsworthy - typically because it means the pitcher has also pitched a shutout!
The Blue Jays won the World series in the year I was born, 1992, and the following year, in 1993. Given I missed watching these seasons and post-seasons, I figure it would be interesting to take a look at the best pitching performers of these postseasons.
SELECT PitchingPost.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..PitchingPost
JOIN lahman2023..People
ON PitchingPost.playerID = People.playerID
WHERE PitchingPost.teamID = 'TOR' AND (PitchingPost.yearID = 1992 OR PitchingPost.yearID = 1993)
Creating a quick table with basic pitching statistics for starting World Series pitchers in 1992 and 1993 gives us a rough idea of who excelled and who struggled in these two series.
# Building a filtered postseason pitching dataset - starters
post_starters <- filter(postpitching_data, GS >= 1 & round == "WS") %>%
relocate(c("nameFirst","nameLast")) %>%
arrange(yearID, ERA) %>%
select(c("nameFirst","nameLast","yearID","round","G","GS","IPouts","H","ER","HR","BB","ERA"))
# Using gt to create a table
gt(post_starters) %>% tab_header(title = "Starting Pitcher Performance - World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>%
opt_stylize(style = 6, color = "blue") %>%
tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Starting Pitcher Performance - World Series | |||||||||||
Toronto Blue Jays (1992 + 1993) | |||||||||||
nameFirst | nameLast | yearID | round | G | GS | IPouts | H | ER | HR | BB | ERA |
---|---|---|---|---|---|---|---|---|---|---|---|
Jimmy | Key | 1992 | WS | 2 | 1 | 27 | 6 | 1 | 0 | 0 | 1.00 |
Juan | Guzman | 1992 | WS | 1 | 1 | 24 | 8 | 1 | 0 | 1 | 1.12 |
David | Cone | 1992 | WS | 2 | 2 | 31 | 9 | 4 | 0 | 8 | 3.48 |
Jack | Morris | 1992 | WS | 2 | 2 | 32 | 13 | 10 | 3 | 6 | 8.44 |
Pat | Hentgen | 1993 | WS | 1 | 1 | 18 | 5 | 1 | 0 | 3 | 1.50 |
Juan | Guzman | 1993 | WS | 2 | 2 | 36 | 10 | 5 | 0 | 8 | 3.75 |
Dave | Stewart | 1993 | WS | 2 | 2 | 36 | 10 | 9 | 2 | 8 | 6.75 |
Todd | Stottlemyre | 1993 | WS | 1 | 1 | 6 | 3 | 6 | 1 | 4 | 27.00 |
Right away, there are two starting pitchers who jump off the page but for opposite reasons. Todd Stottlemyre struggled mightily in his one start of the 1993 World Series, giving up six runs in two innings, for an ERA of 27.00. On the other side of things: Jimmy Key had a fantastic start in the 1992 World Series, and another portion of a game on top - with only six hits and one earned run through 27 IPouts, or outs pitched (9 innings.)
Diving a bit deeper, we can learn from the Baseball Reference Play by Play that he started game 4 of the 1992 World Series and pitched 7.2 innings, only giving up a single run. Game six went to extra innings and Key was called upon to enter with one out already on the board in the bottom of the tenth. He would go on to give up a single in the bottom of the 11th, and a couple groundouts before being pulled for Mike Timlin who secured the final out.
A fun piece of trivia - this game featured Deion Sanders playing LF (Left Field) for Atlanta. He later won two Super Bowl titles in professional American Football, making him the only athlete to play in both a World Series and a Super Bowl!
Taking a quick look at relieving pitchers, we can filter down to pitchers who recorded no starts, but one or more saves in either the ALCS or WS rounds.
# Building a filtered postseason pitching dataset - relievers
post_relievers <- filter(postpitching_data, GS < 1 & SV >= 1) %>%
relocate(c("nameFirst","nameLast")) %>%
arrange(yearID, ERA) %>%
select(c("nameFirst","nameLast","yearID","round","G","SV","IPouts","H","ER","HR","BB","ERA"))
# Using gt to create a table
gt(post_relievers) %>% tab_header(title = "Relieving Pitcher Performance - World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>%
opt_stylize(style = 6, color = "blue") %>%
tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Relieving Pitcher Performance - World Series | |||||||||||
Toronto Blue Jays (1992 + 1993) | |||||||||||
nameFirst | nameLast | yearID | round | G | SV | IPouts | H | ER | HR | BB | ERA |
---|---|---|---|---|---|---|---|---|---|---|---|
Tom | Henke | 1992 | ALCS | 4 | 3 | 14 | 3 | 0 | 0 | 2 | 0.00 |
Mike | Timlin | 1992 | WS | 2 | 1 | 4 | 0 | 0 | 0 | 0 | 0.00 |
Tom | Henke | 1992 | WS | 3 | 2 | 10 | 2 | 1 | 0 | 2 | 2.70 |
Duane | Ward | 1993 | WS | 4 | 2 | 14 | 3 | 1 | 1 | 0 | 1.93 |
Duane | Ward | 1993 | ALCS | 4 | 2 | 14 | 4 | 3 | 2 | 4 | 5.79 |
The heroes of the bullpen are immediately clear. Tom Henke in 1992 and Duane Ward in 1993 both met the threshold and put up five combined saves and four combined saves between the ALCS and World Series, respectively. Incidentally, according to Baseball Reference’s page: Toronto Blue Jays Top 10 Career Pitching Leaders, Tom Henke and Duane Ward are also the first and second overall ERA leaders in franchise history. In third place is another familiar face, Jimmy Key, who started his career as a reliever.
Now for the other side of things, hitting.
SELECT BattingPost.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..BattingPost
JOIN lahman2023..People
ON BattingPost.playerID = People.playerID
WHERE BattingPost.teamID = 'TOR' AND (BattingPost.yearID = 1992 OR BattingPost.yearID = 1993)
Specifically, let’s take a look at Jays hitters in the ALCS and World Series who had greater than five hits, or greater than five runs batted in.
# Building a filtered postseason hitters dataset
post_hitters <- filter(posthitting_data, R > 5 | RBI > 5) %>%
relocate(c("nameFirst","nameLast")) %>%
arrange(yearID, nameFirst) %>%
select(c("nameFirst","nameLast","yearID","round","AB","R","H","HR","RBI"))
# Using gt to create a table
gt(post_hitters) %>% tab_header(title = "Hitter Performance: ALCS + World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>%
opt_stylize(style = 6, color = "blue") %>%
tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Hitter Performance: ALCS + World Series | ||||||||
Toronto Blue Jays (1992 + 1993) | ||||||||
nameFirst | nameLast | yearID | round | AB | R | H | HR | RBI |
---|---|---|---|---|---|---|---|---|
Candy | Maldonado | 1992 | ALCS | 22 | 3 | 6 | 2 | 6 |
Dave | Winfield | 1992 | ALCS | 24 | 7 | 6 | 2 | 3 |
Devon | White | 1993 | WS | 24 | 8 | 7 | 1 | 7 |
Joe | Carter | 1993 | WS | 25 | 6 | 7 | 2 | 8 |
Paul | Molitor | 1993 | ALCS | 23 | 7 | 9 | 1 | 5 |
Paul | Molitor | 1993 | WS | 24 | 10 | 12 | 2 | 8 |
Rickey | Henderson | 1993 | WS | 22 | 6 | 5 | 0 | 2 |
Roberto | Alomar | 1993 | WS | 25 | 5 | 12 | 0 | 6 |
Tony | Fernandez | 1993 | WS | 21 | 2 | 7 | 0 | 9 |
According to Wikipedia’s article on the 1992 World Series, this was notable for being “one of the few six-game series in which the winning team was outscored. This certainly meshes with our table, given only two hitters from 1992 appear, and neither from the World Series that year. The 1993 World Series however was a different story, with six hitters crossing our cutoff. Notably, Paul Molitor appeared both for his good ALCS performance, and his fantastic WS performance.
Another fun appearance is Tony Fernández’s 1993 World Series effort. He lead all post season hitters in both years in the ALCS or WS round in the RBI category, with nine. According to Baseball Reference’s rankings list Toronto Blue Jays Top 10 Career Batting Leaders, Fernández is both ranked 3rd in franchise Offensive WAR (Wins above Replacement) and ranked 1st in franchise Defensive WAR. Fernández played the shortstop position and was known for his defense, he won four consecutive Golden Glove awards, but he was also clearly an impressive hitter and was the first Jay to accomplish over 200 hits in a season.
This was ultimately a really enjoyable exercise. I feel as though I refreshed my experience using Git within RStudio, took a detour (depending on your perspective) to learn more about how SQL servers function, practiced some SQL queries, and learned a lot about my favorite baseball team’s World Series victories. To the last point, as someone who only started following baseball in 2021, I had a lot of fun laying a foundation of understanding about these older championship rosters.
There are two clear next steps for taking this exercise further. First, I would like to build a better understanding of advanced baseball statistics. After that, I would like to play around with modeling some baseball data in order to predict hitting performance.