This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Sat Oct 25 17:26:40 2025.
Data Description:
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(themis)
library(dplyr)
library(stringr)
library(purrr)
library(yardstick)
library(ranger)
library(vetiver)
##
## Attaching package: 'vetiver'
##
## The following object is masked from 'package:tune':
##
## load_pkgs
library(pins)
library(skimr)
healthcare_df <- read.csv("healthcare-dataset-stroke-data.csv")
head(healthcare_df)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
glimpse(healthcare_df)
## Rows: 5,110
## Columns: 12
## $ id <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
skim(healthcare_df)
| Name | healthcare_df |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| ever_married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| work_type | 0 | 1 | 7 | 13 | 0 | 5 | 0 |
| Residence_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| bmi | 0 | 1 | 2 | 4 | 0 | 419 | 0 |
| smoking_status | 0 | 1 | 6 | 15 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
count_unknown_total <- healthcare_df %>%
summarise(across(everything(), ~ sum(. == "Unknown", na.rm = TRUE)))
print(count_unknown_total)
## id gender age hypertension heart_disease ever_married work_type
## 1 0 0 0 0 0 0 0
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 0 0 0 1544 0
count_na_total <- healthcare_df %>%
summarise(across(everything(), ~ sum(. == "N/A", na.rm = TRUE)))
print(count_na_total)
## id gender age hypertension heart_disease ever_married work_type
## 1 0 0 0 0 0 0 0
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 0 0 201 0 0
na_per_column <- colSums(is.na(healthcare_df))
print(na_per_column)
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
healthcare_df_clean <- healthcare_df %>%
mutate(bmi = as.numeric(case_when(bmi == "N/A" ~ NA_real_, TRUE ~ as.numeric(bmi)))) %>%
mutate(smoking_status = case_when(smoking_status == "Unknown" ~ NA_character_, TRUE ~ smoking_status)) %>%
mutate(bmi = ifelse(is.na(bmi), 0, bmi)) %>%
mutate(smoking_status = replace_na(smoking_status, "missing"))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `bmi = as.numeric(case_when(bmi == "N/A" ~ NA_real_, TRUE ~
## as.numeric(bmi)))`.
## Caused by warning:
## ! NAs introduced by coercion
healthcare_df_clean <- healthcare_df_clean %>%
mutate(stroke = factor(stroke, levels = c(1, 0), labels = c("Yes", "No")))
count_unknown_total_clean <- healthcare_df_clean %>%
summarise(across(everything(), ~ sum(. == "Unknown", na.rm = TRUE)))
count_na_total_clean <- healthcare_df_clean %>%
summarise(across(everything(), ~ sum(. == "N/A", na.rm = TRUE)))
na_per_column_clean <- colSums(is.na(healthcare_df_clean))
print(na_per_column_clean)
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
print(count_unknown_total_clean)
## id gender age hypertension heart_disease ever_married work_type
## 1 0 0 0 0 0 0 0
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 0 0 0 0 0
print(count_na_total_clean)
## id gender age hypertension heart_disease ever_married work_type
## 1 0 0 0 0 0 0 0
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 0 0 0 0 0
glimpse(healthcare_df_clean)
## Rows: 5,110
## Columns: 12
## $ id <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi <dbl> 36.6, 0.0, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 0.0, …
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
set.seed(42)
data_split <- initial_split(healthcare_df_clean, prop = 0.70, strata = stroke)
train_data <- training(data_split)
test_data <- testing(data_split)
data_folds <- vfold_cv(train_data, v = 10, strata = stroke)
print(data_folds)
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [3219/358]> Fold01
## 2 <split [3219/358]> Fold02
## 3 <split [3219/358]> Fold03
## 4 <split [3219/358]> Fold04
## 5 <split [3219/358]> Fold05
## 6 <split [3219/358]> Fold06
## 7 <split [3219/358]> Fold07
## 8 <split [3220/357]> Fold08
## 9 <split [3220/357]> Fold09
## 10 <split [3220/357]> Fold10
stroke_recipe <- recipe(stroke ~ ., data = train_data) %>%
step_normalize(all_numeric_predictors(), -all_of(c("id"))) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_smote(stroke, over_ratio = 1)
print(stroke_recipe)
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 11
##
## ── Operations
## • Centering and scaling for: all_numeric_predictors() -all_of(c("id"))
## • Novel factor level assignment for: all_nominal_predictors()
## • Dummy variables from: all_nominal_predictors()
## • SMOTE based on: stroke
log_reg_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
stroke_flow <- workflow() %>%
add_recipe(stroke_recipe) %>%
add_model(log_reg_spec)
print(stroke_flow)
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_normalize()
## • step_novel()
## • step_dummy()
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
log_reg_fit_cv <- stroke_flow %>%
fit_resamples(resamples = data_folds, metrics = metric_set(roc_auc, sensitivity, specificity, accuracy))
## Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
## 1.2.0.
## ℹ See details at
## <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
collect_metrics(log_reg_fit_cv)
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.762 10 0.00742 pre0_mod0_post0
## 2 roc_auc binary 0.840 10 0.00952 pre0_mod0_post0
## 3 sensitivity binary 0.761 10 0.0252 pre0_mod0_post0
## 4 specificity binary 0.763 10 0.00803 pre0_mod0_post0
final_model_fit <- stroke_flow %>%
fit(data = train_data)
test_predictions <- final_model_fit %>%
predict(test_data) %>%
bind_cols(test_data %>% select(stroke))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
conf_mat(test_predictions, truth = stroke, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 64 394
## No 13 1062
test_predictions %>% metrics(truth = stroke, estimate = .pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.735
## 2 kap binary 0.168
eval_metrics <- test_predictions %>%
metrics(truth = stroke, estimate = .pred_class) %>%
bind_rows(test_predictions %>% bal_accuracy(truth = stroke, estimate = .pred_class), test_predictions %>% f_meas(truth = stroke, estimate = .pred_class))
print("--- Logistic Regression Metrics ---")
## [1] "--- Logistic Regression Metrics ---"
print(eval_metrics)
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.735
## 2 kap binary 0.168
## 3 bal_accuracy binary 0.780
## 4 f_meas binary 0.239
rf_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger", num.threads = 4) %>%
set_mode("classification")
rf_wflow <- workflow() %>%
add_recipe(stroke_recipe) %>%
add_model(rf_spec)
message("Starting fit_resamples for Random Forest...")
## Starting fit_resamples for Random Forest...
rf_fit_cv <- rf_wflow %>%
fit_resamples(resamples = data_folds, metrics = metric_set(roc_auc, sensitivity, specificity, accuracy))
print("--- Random Forest (CV) Metrics ---")
## [1] "--- Random Forest (CV) Metrics ---"
collect_metrics(rf_fit_cv)
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.947 10 0.00320 pre0_mod0_post0
## 2 roc_auc binary 0.808 10 0.0100 pre0_mod0_post0
## 3 sensitivity binary 0.0189 10 0.00977 pre0_mod0_post0
## 4 specificity binary 0.994 10 0.00160 pre0_mod0_post0
final_rf_fit <- rf_wflow %>%
fit(data = train_data)
print(final_rf_fit)
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_normalize()
## • step_novel()
## • step_dummy()
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = ~4, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 6810
## Number of independent variables: 21
## Mtry: 4
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.02811774
v <- final_rf_fit %>% vetiver_model(model_name = "stroke_prediction_rf")
## Registered S3 method overwritten by 'butcher':
## method from
## as.character.dev_topic generics
print(v)
##
## ── stroke_prediction_rf ─ <bundled_workflow> model for deployment
## A ranger classification modeling workflow using 11 features
board_path <- "C:/Users/ktss/Desktop/stroke_api/model_board"
board <- board_folder(path = board_path)
board %>% vetiver_pin_write(v)
## Replacing version '20251025T232430Z-a4f60' with '20251025T232803Z-c1f9a'
## Writing to pin 'stroke_prediction_rf'
##
## Create a Model Card for your published model
## • Model Cards provide a framework for transparent, responsible reporting
## • Use the vetiver `.Rmd` template as a place to start
vetiver_write_plumber(board, "stroke_prediction_rf")
This section summarizes the process, the evaluation of the stroke prediction models, and the key learnings from the project.
The project involved a comprehensive machine learning workflow, starting with extensive data preparation.
bmi
column, which contained “N/A” values, required transformation from
chr to dbl with imputation, replacing “N/A”s
(or missing values) with 0. The smoking_status was handled
by converting “Unknown” to a specific category, such as “missing”, to
retain information.step_normalize() for
numerical features and step_dummy() for categorical
features.stroke was heavily imbalanced (very few positive
cases). To address this critical issue, the step_smote()
function from the themis package was incorporated into the
recipe, generating synthetic samples of the minority class to create a
more balanced training dataset.Two distinct models were compared using 10-fold cross-validation
(fit_resamples): Logistic Regression and
Random Forest.
| Métrica | Regresión Logística (RL) CV | Random Forest (RF) CV |
|---|---|---|
Accuracy |
0.762 | 0.947 |
roc_auc |
0.840 | 0.808 |
Sensitivity |
0.761 | 0.0189 |
Specificity |
0.763 | 0.994 |
Accuracy (0.947) and near-perfect Specificity
(0.994), indicating it is excellent at correctly identifying patients
who do not have a stroke (True Negatives).Sensitivity was extremely low (0.0189). In a clinical
setting, this means the model fails to identify almost all
actual stroke cases (high rate of False Negatives). This
behavior is likely due to the model overfitting to the majority class
(“No Stroke”) or the SMOTE technique being ineffective for
this specific algorithm/dataset combination.Sensitivity (0.761) and
Specificity (0.763). Although its overall
Accuracy is lower, its balanced approach to classifying
both positive and negative cases is clinically more
responsible.Based on the evaluation, the Logistic Regression
model is the more reliable choice for an initial production environment,
as its ability to correctly identify positive stroke cases
(Sensitivity) is significantly better than the Random
Forest model. The high performance of the Random Forest on
Accuracy is misleading due to its near-zero
Sensitivity.
Sensitivity without
completely sacrificing Specificity.vetiver package to create a Plumber REST
API. This implementation allows for real-time predictions and
integration into external applications, completing the end-to-end
machine learning lifecycle.