Skip to content

Fix: Handle StringDtype and CategoricalDtype columns in classifier training pipeline #153

@zmek

Description

@zmek

Bug: ValueError: could not convert string to float when training classifiers with non-object string columns

Description

Bug found by @helenajr - original request copied below.

Training a classifier fails with ValueError: could not convert string to float: 'Ambulance' when categorical columns (e.g. arrival_method) use pandas StringDtype ("string") or CategoricalDtype ("category") instead of the legacy "object" dtype.

Root cause

create_column_transformer uses df[col].dtype == "object" to detect columns that need one-hot encoding. This check misses two other common string-like dtypes:

  • StringDtype — used by default in some newer pandas configurations, or when users call .astype("string")
  • CategoricalDtype — common when users optimise memory with .astype("category")

When the check fails, the column falls through to the StandardScaler() branch, which attempts to convert strings to floats.

The same narrow check exists in FeatureColumnTransformer.fit for computing per-column default values.

Traceback

ValueError                                Traceback (most recent call last)
     9 model = train_classifier(
    10     train_visits=train_visits,
    11     valid_visits=valid_visits,
    12     test_visits=test_visits,
    ...
--> 718 cv_results = chronological_cross_validation(
    719     pipeline, X_train, y_train, n_splits=5
    720 )

    231 for train_idx, valid_idx in tscv.split(X):
    232     X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]

ValueError: could not convert string to float: 'Ambulance'

Fix

Replace the dtype == "object" checks with a helper that covers all three string-like dtypes:

def _is_string_like_column(series: Series) -> bool:
    if isinstance(series.dtype, pd.CategoricalDtype):
        series = series.cat.categories.to_series()
    return not (
        pd.api.types.is_numeric_dtype(series)
        or pd.api.types.is_bool_dtype(series)
        or pd.api.types.is_datetime64_any_dtype(series)
    )

Files changed:

  • src/patientflow/train/classifiers.py — added _is_string_like_column helper; updated create_column_transformer and FeatureColumnTransformer.fit
  • tests/test_classifiers.py — added test_string_dtype_columns and test_categorical_dtype_columns

Original issue raised privately by @helenajr

@zmek hello! I was previously working with a version of the package installed from the fix-modelling-issues branch and everything worked fine. I've now switched to using the released version of the package (so I can deploy it on the server) and this chunk (train classifier) has stopped working. I get the same error whether I'm using the latest release or v1.1.2

Train the models

prediction_times = filtered_snapshots['prediction_time'].unique().tolist() 
display(Markdown(f'**Prediction time list:** {prediction_times}')) 
display(Markdown(f'**Iterating through prediction times and training model for each time.....**')) 
 
trained_models = {} 
for prediction_time in prediction_times: 
 
    # train model 
    model = train_classifier( 
        train_visits=train_visits, 
        valid_visits=valid_visits, 
        test_visits=test_visits, 
        grid={ 
            "n_estimators": [5, 8, 10, 20, 30] 
            }, 
        prediction_time=prediction_time, 
        exclude_from_training_data=exclude_from_training_data, 
        ordinal_mappings={'acuity': [1, 2, 3, 4, 5]}, 
        single_snapshot_per_visit=True, 
        visit_col='ENCNTR_ID', # as we are using a single snapshot per visit, we need to specify which column contains the visit number 
        use_balanced_training=True, 
        calibrate_probabilities=True, 
        calibration_method='sigmoid', 
        evaluate_on_test=True, # by default, this is set to False; only evaluate on the test set when happy with validation set  
        label_col = outcome_variable 
    ) 
 
    # print time 
The error I get is:

ValueError Traceback (most recent call last) in ?() 5 trained_models = {} 6 for prediction_time in prediction_times: 7 8 # train model ----> 9 model = train_classifier( 10 train_visits=train_visits, 11 valid_visits=valid_visits, 12 test_visits=test_visits,

ValueError: could not convert string to float: 'Ambulance'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions