About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Sat Oct 25 17:26:40 2025.

Data Description:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Task One: Import data and data preprocessing

Load data and install packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ rsample      1.3.1 
## ✔ dials        1.4.2      ✔ tailor       0.1.0 
## ✔ infer        1.0.9      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ recipes      1.3.1      ✔ yardstick    1.3.2 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(themis)

library(dplyr)
library(stringr)
library(purrr)
library(yardstick)
library(ranger)
library(vetiver)
## 
## Attaching package: 'vetiver'
## 
## The following object is masked from 'package:tune':
## 
##     load_pkgs
library(pins)

library(skimr)

healthcare_df <- read.csv("healthcare-dataset-stroke-data.csv")

head(healthcare_df)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

Describe and explore the data

glimpse(healthcare_df)
## Rows: 5,110
## Columns: 12
## $ id                <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension      <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease     <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
skim(healthcare_df)
Data summary
Name healthcare_df
Number of rows 5110
Number of columns 12
_______________________
Column type frequency:
character 6
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 3 0
ever_married 0 1 2 3 0 2 0
work_type 0 1 7 13 0 5 0
Residence_type 0 1 5 5 0 2 0
bmi 0 1 2 4 0 419 0
smoking_status 0 1 6 15 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 36517.83 21161.72 67.00 17741.25 36932.00 54682.00 72940.00 ▇▇▇▇▇
age 0 1 43.23 22.61 0.08 25.00 45.00 61.00 82.00 ▅▆▇▇▆
hypertension 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
heart_disease 0 1 0.05 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
avg_glucose_level 0 1 106.15 45.28 55.12 77.24 91.88 114.09 271.74 ▇▃▁▁▁
stroke 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
count_unknown_total <- healthcare_df %>%
  summarise(across(everything(), ~ sum(. == "Unknown", na.rm = TRUE)))

print(count_unknown_total)
##   id gender age hypertension heart_disease ever_married work_type
## 1  0      0   0            0             0            0         0
##   Residence_type avg_glucose_level bmi smoking_status stroke
## 1              0                 0   0           1544      0
count_na_total <- healthcare_df %>%
  summarise(across(everything(), ~ sum(. == "N/A", na.rm = TRUE)))

print(count_na_total)
##   id gender age hypertension heart_disease ever_married work_type
## 1  0      0   0            0             0            0         0
##   Residence_type avg_glucose_level bmi smoking_status stroke
## 1              0                 0 201              0      0
na_per_column <- colSums(is.na(healthcare_df))

print(na_per_column)
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0
healthcare_df_clean <- healthcare_df %>%
  mutate(bmi = as.numeric(case_when(bmi == "N/A" ~ NA_real_, TRUE ~ as.numeric(bmi)))) %>%
  mutate(smoking_status = case_when(smoking_status == "Unknown" ~ NA_character_, TRUE ~ smoking_status)) %>%
  mutate(bmi = ifelse(is.na(bmi), 0, bmi)) %>%
  mutate(smoking_status = replace_na(smoking_status, "missing"))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `bmi = as.numeric(case_when(bmi == "N/A" ~ NA_real_, TRUE ~
##   as.numeric(bmi)))`.
## Caused by warning:
## ! NAs introduced by coercion
healthcare_df_clean <- healthcare_df_clean %>%
  mutate(stroke = factor(stroke, levels = c(1, 0), labels = c("Yes", "No")))

count_unknown_total_clean <- healthcare_df_clean %>%
  summarise(across(everything(), ~ sum(. == "Unknown", na.rm = TRUE)))

count_na_total_clean <- healthcare_df_clean %>%
  summarise(across(everything(), ~ sum(. == "N/A", na.rm = TRUE)))

na_per_column_clean <- colSums(is.na(healthcare_df_clean))

print(na_per_column_clean)
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0
print(count_unknown_total_clean)
##   id gender age hypertension heart_disease ever_married work_type
## 1  0      0   0            0             0            0         0
##   Residence_type avg_glucose_level bmi smoking_status stroke
## 1              0                 0   0              0      0
print(count_na_total_clean)
##   id gender age hypertension heart_disease ever_married work_type
## 1  0      0   0            0             0            0         0
##   Residence_type avg_glucose_level bmi smoking_status stroke
## 1              0                 0   0              0      0
glimpse(healthcare_df_clean)
## Rows: 5,110
## Columns: 12
## $ id                <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension      <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease     <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi               <dbl> 36.6, 0.0, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 0.0, …
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke            <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…

Task Two: Build prediction models

set.seed(42)
data_split <- initial_split(healthcare_df_clean, prop = 0.70, strata = stroke)

train_data <- training(data_split)
test_data <- testing(data_split)

data_folds <- vfold_cv(train_data, v = 10, strata = stroke)

print(data_folds)
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [3219/358]> Fold01
##  2 <split [3219/358]> Fold02
##  3 <split [3219/358]> Fold03
##  4 <split [3219/358]> Fold04
##  5 <split [3219/358]> Fold05
##  6 <split [3219/358]> Fold06
##  7 <split [3219/358]> Fold07
##  8 <split [3220/357]> Fold08
##  9 <split [3220/357]> Fold09
## 10 <split [3220/357]> Fold10
stroke_recipe <- recipe(stroke ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors(), -all_of(c("id"))) %>%
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_smote(stroke, over_ratio = 1)

print(stroke_recipe)
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 11
## 
## ── Operations
## • Centering and scaling for: all_numeric_predictors() -all_of(c("id"))
## • Novel factor level assignment for: all_nominal_predictors()
## • Dummy variables from: all_nominal_predictors()
## • SMOTE based on: stroke
log_reg_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

stroke_flow <- workflow() %>%
  add_recipe(stroke_recipe) %>%
  add_model(log_reg_spec)

print(stroke_flow)
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_normalize()
## • step_novel()
## • step_dummy()
## • step_smote()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
log_reg_fit_cv <- stroke_flow %>%
  fit_resamples(resamples = data_folds, metrics = metric_set(roc_auc, sensitivity, specificity, accuracy))
## Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
## 1.2.0.
## ℹ See details at
##   <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
collect_metrics(log_reg_fit_cv)
## # A tibble: 4 × 6
##   .metric     .estimator  mean     n std_err .config        
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
## 1 accuracy    binary     0.762    10 0.00742 pre0_mod0_post0
## 2 roc_auc     binary     0.840    10 0.00952 pre0_mod0_post0
## 3 sensitivity binary     0.761    10 0.0252  pre0_mod0_post0
## 4 specificity binary     0.763    10 0.00803 pre0_mod0_post0
final_model_fit <- stroke_flow %>%
  fit(data = train_data)

test_predictions <- final_model_fit %>%
  predict(test_data) %>%
  bind_cols(test_data %>% select(stroke))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
conf_mat(test_predictions, truth = stroke, estimate = .pred_class)
##           Truth
## Prediction  Yes   No
##        Yes   64  394
##        No    13 1062
test_predictions %>% metrics(truth = stroke, estimate = .pred_class)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.735
## 2 kap      binary         0.168

Task Three: Evaluate and select prediction models

eval_metrics <- test_predictions %>%
  metrics(truth = stroke, estimate = .pred_class) %>%
  bind_rows(test_predictions %>% bal_accuracy(truth = stroke, estimate = .pred_class), test_predictions %>% f_meas(truth = stroke, estimate = .pred_class))

print("--- Logistic Regression Metrics ---")
## [1] "--- Logistic Regression Metrics ---"
print(eval_metrics)
## # A tibble: 4 × 3
##   .metric      .estimator .estimate
##   <chr>        <chr>          <dbl>
## 1 accuracy     binary         0.735
## 2 kap          binary         0.168
## 3 bal_accuracy binary         0.780
## 4 f_meas       binary         0.239
rf_spec <- rand_forest(trees = 1000) %>%
  set_engine("ranger", num.threads = 4) %>%
  set_mode("classification")

rf_wflow <- workflow() %>%
  add_recipe(stroke_recipe) %>%
  add_model(rf_spec)

message("Starting fit_resamples for Random Forest...")
## Starting fit_resamples for Random Forest...
rf_fit_cv <- rf_wflow %>%
  fit_resamples(resamples = data_folds, metrics = metric_set(roc_auc, sensitivity, specificity, accuracy))

print("--- Random Forest (CV) Metrics ---")
## [1] "--- Random Forest (CV) Metrics ---"
collect_metrics(rf_fit_cv)
## # A tibble: 4 × 6
##   .metric     .estimator   mean     n std_err .config        
##   <chr>       <chr>       <dbl> <int>   <dbl> <chr>          
## 1 accuracy    binary     0.947     10 0.00320 pre0_mod0_post0
## 2 roc_auc     binary     0.808     10 0.0100  pre0_mod0_post0
## 3 sensitivity binary     0.0189    10 0.00977 pre0_mod0_post0
## 4 specificity binary     0.994     10 0.00160 pre0_mod0_post0

Task Four: Deploy the prediction model

final_rf_fit <- rf_wflow %>%
  fit(data = train_data)

print(final_rf_fit)
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_normalize()
## • step_novel()
## • step_dummy()
## • step_smote()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = ~4, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      6810 
## Number of independent variables:  21 
## Mtry:                             4 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.02811774
v <- final_rf_fit %>% vetiver_model(model_name = "stroke_prediction_rf")
## Registered S3 method overwritten by 'butcher':
##   method                 from    
##   as.character.dev_topic generics
print(v)
## 
## ── stroke_prediction_rf ─ <bundled_workflow> model for deployment 
## A ranger classification modeling workflow using 11 features
board_path <- "C:/Users/ktss/Desktop/stroke_api/model_board"

board <- board_folder(path = board_path)

board %>% vetiver_pin_write(v)
## Replacing version '20251025T232430Z-a4f60' with '20251025T232803Z-c1f9a'
## Writing to pin 'stroke_prediction_rf'
## 
## Create a Model Card for your published model
## • Model Cards provide a framework for transparent, responsible reporting
## • Use the vetiver `.Rmd` template as a place to start
vetiver_write_plumber(board, "stroke_prediction_rf")

Task Five: Findings and Conclusions

This section summarizes the process, the evaluation of the stroke prediction models, and the key learnings from the project.

1. Summary of Process and Data Preprocessing

The project involved a comprehensive machine learning workflow, starting with extensive data preparation.

Key Data Steps:

  • Missing Data Imputation: The bmi column, which contained “N/A” values, required transformation from chr to dbl with imputation, replacing “N/A”s (or missing values) with 0. The smoking_status was handled by converting “Unknown” to a specific category, such as “missing”, to retain information.
  • Feature Engineering: Categorical variables were prepared using a recipe that included step_normalize() for numerical features and step_dummy() for categorical features.
  • Handling Class Imbalance: The binary target variable stroke was heavily imbalanced (very few positive cases). To address this critical issue, the step_smote() function from the themis package was incorporated into the recipe, generating synthetic samples of the minority class to create a more balanced training dataset.

2. Model Evaluation and Selection

Two distinct models were compared using 10-fold cross-validation (fit_resamples): Logistic Regression and Random Forest.

Métrica Regresión Logística (RL) CV Random Forest (RF) CV
Accuracy 0.762 0.947
roc_auc 0.840 0.808
Sensitivity 0.761 0.0189
Specificity 0.763 0.994

Key Findings:

  1. High Accuracy vs. Low Sensitivity (Random Forest):
    • The Random Forest model achieved a very high Accuracy (0.947) and near-perfect Specificity (0.994), indicating it is excellent at correctly identifying patients who do not have a stroke (True Negatives).
    • CRITICAL FLAW: However, its Sensitivity was extremely low (0.0189). In a clinical setting, this means the model fails to identify almost all actual stroke cases (high rate of False Negatives). This behavior is likely due to the model overfitting to the majority class (“No Stroke”) or the SMOTE technique being ineffective for this specific algorithm/dataset combination.
  2. Balanced Performance (Logistic Regression):
    • The Logistic Regression model showed a much better balance between Sensitivity (0.761) and Specificity (0.763). Although its overall Accuracy is lower, its balanced approach to classifying both positive and negative cases is clinically more responsible.
    • It also achieved a slightly higher AUC-ROC (0.840), indicating a better overall discrimination capacity between the two classes.

3. Conclusions and Future Work

Conclusion:

Based on the evaluation, the Logistic Regression model is the more reliable choice for an initial production environment, as its ability to correctly identify positive stroke cases (Sensitivity) is significantly better than the Random Forest model. The high performance of the Random Forest on Accuracy is misleading due to its near-zero Sensitivity.

Future Steps:

  • Model Optimization: Focus on tuning the hyperparameters of the Random Forest model to find a better trade-off point that increases Sensitivity without completely sacrificing Specificity.
  • Class Imbalance Techniques: Explore alternative strategies to handle the class imbalance, such as class weights or different oversampling/undersampling methods, to ensure the minority class is learned effectively.
  • Implementation: The project successfully demonstrated the deployment phase using the vetiver package to create a Plumber REST API. This implementation allows for real-time predictions and integration into external applications, completing the end-to-end machine learning lifecycle.