Introduction

 

The intention behind this quick and fun exercise is to improve my SQL skills, and to dive into some baseball statistics.

To begin, I downloaded the SQL version of Sean Lahman’s iconic baseball database that spans the 1871-2023 seasons. You can find the database, and more about Sean Lahman here.

 

Accessing the SQL database

 

The SQL database file was in a .bak format, which I was unfamiliar with initially. I learned that this was a common way to export a backup of a database, and that in order to access and query it, I would need to restore the database using SQL Server Management Studio. Once this was completed, I added my local SQL server and the restored database as an ODBC Data Source. Finally, I used the DBI library in R, in order to establish a connection to the database.

# Libraries
library("DBI")
library("tidyverse")
library("gt")
# Establishing ODBC connection to 'SQLserveR'
con <- dbConnect(odbc::odbc(), "SQLserveR")

 

Toronto Blue Jays - Historic Regular Season Pitching Data

 

I am interested in pitching statistics for my favorite team, the Toronto Blue Jays. Using the query below, I have created a data frame that contains the regular season pitching data for every pitcher, in every year that the Jays have existed, minus this past season (1977-2023.) Included in this dataframe are the pitcher’s first and last names from the People table of lahman2023, and a bevy of pitching statistics from the Pitching table.

SELECT Pitching.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..Pitching
JOIN lahman2023..People
ON Pitching.playerID = People.playerID
WHERE Pitching.teamID = 'TOR'
ORDER BY Pitching.yearID

 

Using this dataframe, I wanted to look at two statistics in particular - Completed Games (CG) and Saves (SV).

# Creating a dataframe of CG and SV by yearID
p1 <- pitching_data %>% group_by(yearID) %>% summarize("Completed Games (CG)" = sum(CG), "Saves (SV)" = sum(SV))

# Pivoting dataframe
p1_long <- p1 %>% gather("Statistic", "Count", -yearID)

# Plotting CG and SV by yearID
ggplot(p1_long, aes(x = yearID, y = Count, fill = Statistic)) + geom_col(position = "dodge") + labs(title = "Pitching: Total Completed Games and Saves by Year\nToronto Blue Jays (1977-2023)", x = "Year")

Plotting these two basic statistics allows us to see a clear trend that has been a source of much discussion for baseball fans: the sharp decline in the number of complete games pitched. A “Completed Game” (CG) is fairly self explanatory, it is simply when a pitcher pitches the entirety of a game. A “Save” (SV) is a bit more complicated, as the pitcher must finish the game on the winning team and do one of the following: pitch at least three innings, enter the game with a lead of three or less runs while pitching at least one inning, or enter the game with the tying run in position (on-deck, at plate, or on the bases.)

Taking a look at our graph of completed games and saves per year, we can see the beginning of this trend towards using relief pitchers to finish games. The number of saves for the Toronto Blue Jays pitchers skyrockets in the 80’s and never lets up. In the modern era we begin to see years without a single completed game pitched, and when they happen they are newsworthy - typically because it means the pitcher has also pitched a shutout!

 

Toronto Blue Jays - 1992 and 1993 Postseason Pitching

 

The Blue Jays won the World series in the year I was born, 1992, and the following year, in 1993. Given I missed watching these seasons and post-seasons, I figure it would be interesting to take a look at the best pitching performers of these postseasons.

SELECT PitchingPost.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..PitchingPost
JOIN lahman2023..People
ON PitchingPost.playerID = People.playerID
WHERE PitchingPost.teamID = 'TOR' AND (PitchingPost.yearID = 1992 OR PitchingPost.yearID = 1993)

 

Creating a quick table with basic pitching statistics for starting World Series pitchers in 1992 and 1993 gives us a rough idea of who excelled and who struggled in these two series.

# Building a filtered postseason pitching dataset - starters
post_starters <- filter(postpitching_data, GS >= 1 & round == "WS") %>% 
  relocate(c("nameFirst","nameLast")) %>%
  arrange(yearID, ERA) %>% 
  select(c("nameFirst","nameLast","yearID","round","G","GS","IPouts","H","ER","HR","BB","ERA"))

# Using gt to create a table
gt(post_starters) %>% tab_header(title = "Starting Pitcher Performance - World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>% 
  opt_stylize(style = 6, color = "blue") %>%
  tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
  tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Starting Pitcher Performance - World Series
Toronto Blue Jays (1992 + 1993)
nameFirst nameLast yearID round G GS IPouts H ER HR BB ERA
Jimmy Key 1992 WS 2 1 27 6 1 0 0 1.00
Juan Guzman 1992 WS 1 1 24 8 1 0 1 1.12
David Cone 1992 WS 2 2 31 9 4 0 8 3.48
Jack Morris 1992 WS 2 2 32 13 10 3 6 8.44
Pat Hentgen 1993 WS 1 1 18 5 1 0 3 1.50
Juan Guzman 1993 WS 2 2 36 10 5 0 8 3.75
Dave Stewart 1993 WS 2 2 36 10 9 2 8 6.75
Todd Stottlemyre 1993 WS 1 1 6 3 6 1 4 27.00

 

Right away, there are two starting pitchers who jump off the page but for opposite reasons. Todd Stottlemyre struggled mightily in his one start of the 1993 World Series, giving up six runs in two innings, for an ERA of 27.00. On the other side of things: Jimmy Key had a fantastic start in the 1992 World Series, and another portion of a game on top - with only six hits and one earned run through 27 IPouts, or outs pitched (9 innings.)

Diving a bit deeper, we can learn from the Baseball Reference Play by Play that he started game 4 of the 1992 World Series and pitched 7.2 innings, only giving up a single run. Game six went to extra innings and Key was called upon to enter with one out already on the board in the bottom of the tenth. He would go on to give up a single in the bottom of the 11th, and a couple groundouts before being pulled for Mike Timlin who secured the final out.

A fun piece of trivia - this game featured Deion Sanders playing LF (Left Field) for Atlanta. He later won two Super Bowl titles in professional American Football, making him the only athlete to play in both a World Series and a Super Bowl!

Taking a quick look at relieving pitchers, we can filter down to pitchers who recorded no starts, but one or more saves in either the ALCS or WS rounds.

# Building a filtered postseason pitching dataset - relievers
post_relievers <- filter(postpitching_data, GS < 1 & SV >= 1) %>% 
  relocate(c("nameFirst","nameLast")) %>%
  arrange(yearID, ERA) %>% 
  select(c("nameFirst","nameLast","yearID","round","G","SV","IPouts","H","ER","HR","BB","ERA"))

# Using gt to create a table
gt(post_relievers) %>% tab_header(title = "Relieving Pitcher Performance - World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>% 
  opt_stylize(style = 6, color = "blue") %>%
  tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
  tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Relieving Pitcher Performance - World Series
Toronto Blue Jays (1992 + 1993)
nameFirst nameLast yearID round G SV IPouts H ER HR BB ERA
Tom Henke 1992 ALCS 4 3 14 3 0 0 2 0.00
Mike Timlin 1992 WS 2 1 4 0 0 0 0 0.00
Tom Henke 1992 WS 3 2 10 2 1 0 2 2.70
Duane Ward 1993 WS 4 2 14 3 1 1 0 1.93
Duane Ward 1993 ALCS 4 2 14 4 3 2 4 5.79

 

The heroes of the bullpen are immediately clear. Tom Henke in 1992 and Duane Ward in 1993 both met the threshold and put up five combined saves and four combined saves between the ALCS and World Series, respectively. Incidentally, according to Baseball Reference’s page: Toronto Blue Jays Top 10 Career Pitching Leaders, Tom Henke and Duane Ward are also the first and second overall ERA leaders in franchise history. In third place is another familiar face, Jimmy Key, who started his career as a reliever.

 

Toronto Blue Jays - 1992 and 1993 Postseason Hitting

 

Now for the other side of things, hitting.

SELECT BattingPost.*, People.nameFirst, People.nameLast, People.throws
FROM lahman2023..BattingPost
JOIN lahman2023..People
ON BattingPost.playerID = People.playerID
WHERE BattingPost.teamID = 'TOR' AND (BattingPost.yearID = 1992 OR BattingPost.yearID = 1993)

Specifically, let’s take a look at Jays hitters in the ALCS and World Series who had greater than five hits, or greater than five runs batted in.

# Building a filtered postseason hitters dataset
post_hitters <- filter(posthitting_data, R > 5 | RBI > 5) %>% 
  relocate(c("nameFirst","nameLast")) %>%
  arrange(yearID, nameFirst) %>% 
  select(c("nameFirst","nameLast","yearID","round","AB","R","H","HR","RBI"))

# Using gt to create a table
gt(post_hitters) %>% tab_header(title = "Hitter Performance: ALCS + World Series", subtitle = "Toronto Blue Jays (1992 + 1993)") %>% 
  opt_stylize(style = 6, color = "blue") %>%
  tab_style(cell_fill(color = "#AECCE4"), locations = cells_body(rows = yearID > 1992)) %>%
  tab_style(cell_fill(color = "#9ABDDC"), locations = cells_body(rows = yearID == 1992))
Hitter Performance: ALCS + World Series
Toronto Blue Jays (1992 + 1993)
nameFirst nameLast yearID round AB R H HR RBI
Candy Maldonado 1992 ALCS 22 3 6 2 6
Dave Winfield 1992 ALCS 24 7 6 2 3
Devon White 1993 WS 24 8 7 1 7
Joe Carter 1993 WS 25 6 7 2 8
Paul Molitor 1993 ALCS 23 7 9 1 5
Paul Molitor 1993 WS 24 10 12 2 8
Rickey Henderson 1993 WS 22 6 5 0 2
Roberto Alomar 1993 WS 25 5 12 0 6
Tony Fernandez 1993 WS 21 2 7 0 9

 

According to Wikipedia’s article on the 1992 World Series, this was notable for being “one of the few six-game series in which the winning team was outscored. This certainly meshes with our table, given only two hitters from 1992 appear, and neither from the World Series that year. The 1993 World Series however was a different story, with six hitters crossing our cutoff. Notably, Paul Molitor appeared both for his good ALCS performance, and his fantastic WS performance.

 

Tony Fernández (Image Credit: G. Paul Burnett, Associated Press file)
Tony Fernández (Image Credit: G. Paul Burnett, Associated Press file)

 

Another fun appearance is Tony Fernández’s 1993 World Series effort. He lead all post season hitters in both years in the ALCS or WS round in the RBI category, with nine. According to Baseball Reference’s rankings list Toronto Blue Jays Top 10 Career Batting Leaders, Fernández is both ranked 3rd in franchise Offensive WAR (Wins above Replacement) and ranked 1st in franchise Defensive WAR. Fernández played the shortstop position and was known for his defense, he won four consecutive Golden Glove awards, but he was also clearly an impressive hitter and was the first Jay to accomplish over 200 hits in a season.

 

Conclusion

 

This was ultimately a really enjoyable exercise. I feel as though I refreshed my experience using Git within RStudio, took a detour (depending on your perspective) to learn more about how SQL servers function, practiced some SQL queries, and learned a lot about my favorite baseball team’s World Series victories. To the last point, as someone who only started following baseball in 2021, I had a lot of fun laying a foundation of understanding about these older championship rosters.

There are two clear next steps for taking this exercise further. First, I would like to build a better understanding of advanced baseball statistics. After that, I would like to play around with modeling some baseball data in order to predict hitting performance.