Fix: Handle StringDtype and CategoricalDtype columns in classifier training pipeline

## Bug: `ValueError: could not convert string to float` when training classifiers with non-object string columns

### Description

Bug found by @helenajr - original request copied below. 

Training a classifier fails with `ValueError: could not convert string to float: 'Ambulance'` when categorical columns (e.g. `arrival_method`) use pandas `StringDtype` (`"string"`) or `CategoricalDtype` (`"category"`) instead of the legacy `"object"` dtype.

### Root cause

`create_column_transformer` uses `df[col].dtype == "object"` to detect columns that need one-hot encoding. This check misses two other common string-like dtypes:

- **`StringDtype`** — used by default in some newer pandas configurations, or when users call `.astype("string")`
- **`CategoricalDtype`** — common when users optimise memory with `.astype("category")`

When the check fails, the column falls through to the `StandardScaler()` branch, which attempts to convert strings to floats.

The same narrow check exists in `FeatureColumnTransformer.fit` for computing per-column default values.

### Traceback

```
ValueError                                Traceback (most recent call last)
     9 model = train_classifier(
    10     train_visits=train_visits,
    11     valid_visits=valid_visits,
    12     test_visits=test_visits,
    ...
--> 718 cv_results = chronological_cross_validation(
    719     pipeline, X_train, y_train, n_splits=5
    720 )

    231 for train_idx, valid_idx in tscv.split(X):
    232     X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]

ValueError: could not convert string to float: 'Ambulance'
```

### Fix

Replace the `dtype == "object"` checks with a helper that covers all three string-like dtypes:

```python
def _is_string_like_column(series: Series) -> bool:
    if isinstance(series.dtype, pd.CategoricalDtype):
        series = series.cat.categories.to_series()
    return not (
        pd.api.types.is_numeric_dtype(series)
        or pd.api.types.is_bool_dtype(series)
        or pd.api.types.is_datetime64_any_dtype(series)
    )
```

**Files changed:**
- `src/patientflow/train/classifiers.py` — added `_is_string_like_column` helper; updated `create_column_transformer` and `FeatureColumnTransformer.fit`
- `tests/test_classifiers.py` — added `test_string_dtype_columns` and `test_categorical_dtype_columns`

## Original issue raised privately by @helenajr 

> [@zmek](https://github.com/zmek) hello! I was previously working with a version of the package installed from the fix-modelling-issues branch and everything worked fine. I've now switched to using the released version of the package (so I can deploy it on the server) and this chunk (train classifier) has stopped working. I get the same error whether I'm using the latest release or v1.1.2
> 
>  ## Train the models 
>  ```{python} 
>  prediction_times = filtered_snapshots['prediction_time'].unique().tolist() 
>  display(Markdown(f'**Prediction time list:** {prediction_times}')) 
>  display(Markdown(f'**Iterating through prediction times and training model for each time.....**')) 
>   
>  trained_models = {} 
>  for prediction_time in prediction_times: 
>   
>      # train model 
>      model = train_classifier( 
>          train_visits=train_visits, 
>          valid_visits=valid_visits, 
>          test_visits=test_visits, 
>          grid={ 
>              "n_estimators": [5, 8, 10, 20, 30] 
>              }, 
>          prediction_time=prediction_time, 
>          exclude_from_training_data=exclude_from_training_data, 
>          ordinal_mappings={'acuity': [1, 2, 3, 4, 5]}, 
>          single_snapshot_per_visit=True, 
>          visit_col='ENCNTR_ID', # as we are using a single snapshot per visit, we need to specify which column contains the visit number 
>          use_balanced_training=True, 
>          calibrate_probabilities=True, 
>          calibration_method='sigmoid', 
>          evaluate_on_test=True, # by default, this is set to False; only evaluate on the test set when happy with validation set  
>          label_col = outcome_variable 
>      ) 
>   
>      # print time 
> The error I get is:
> 
> ValueError Traceback (most recent call last) in ?() 5 trained_models = {} 6 for prediction_time in prediction_times: 7 8 # train model ----> 9 model = train_classifier( 10 train_visits=train_visits, 11 valid_visits=valid_visits, 12 test_visits=test_visits,
> 
> ValueError: could not convert string to float: 'Ambulance'



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Handle StringDtype and CategoricalDtype columns in classifier training pipeline #153

Bug: `ValueError: could not convert string to float` when training classifiers with non-object string columns

Description

Root cause

Traceback

Fix

Original issue raised privately by @helenajr

Train the models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix: Handle StringDtype and CategoricalDtype columns in classifier training pipeline #153

Description

Bug: ValueError: could not convert string to float when training classifiers with non-object string columns

Description

Root cause

Traceback

Fix

Original issue raised privately by @helenajr

Train the models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: `ValueError: could not convert string to float` when training classifiers with non-object string columns