Bug: ValueError: could not convert string to float when training classifiers with non-object string columns
Description
Bug found by @helenajr - original request copied below.
Training a classifier fails with ValueError: could not convert string to float: 'Ambulance' when categorical columns (e.g. arrival_method) use pandas StringDtype ("string") or CategoricalDtype ("category") instead of the legacy "object" dtype.
Root cause
create_column_transformer uses df[col].dtype == "object" to detect columns that need one-hot encoding. This check misses two other common string-like dtypes:
StringDtype — used by default in some newer pandas configurations, or when users call .astype("string")
CategoricalDtype — common when users optimise memory with .astype("category")
When the check fails, the column falls through to the StandardScaler() branch, which attempts to convert strings to floats.
The same narrow check exists in FeatureColumnTransformer.fit for computing per-column default values.
Traceback
ValueError Traceback (most recent call last)
9 model = train_classifier(
10 train_visits=train_visits,
11 valid_visits=valid_visits,
12 test_visits=test_visits,
...
--> 718 cv_results = chronological_cross_validation(
719 pipeline, X_train, y_train, n_splits=5
720 )
231 for train_idx, valid_idx in tscv.split(X):
232 X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
ValueError: could not convert string to float: 'Ambulance'
Fix
Replace the dtype == "object" checks with a helper that covers all three string-like dtypes:
def _is_string_like_column(series: Series) -> bool:
if isinstance(series.dtype, pd.CategoricalDtype):
series = series.cat.categories.to_series()
return not (
pd.api.types.is_numeric_dtype(series)
or pd.api.types.is_bool_dtype(series)
or pd.api.types.is_datetime64_any_dtype(series)
)
Files changed:
src/patientflow/train/classifiers.py — added _is_string_like_column helper; updated create_column_transformer and FeatureColumnTransformer.fit
tests/test_classifiers.py — added test_string_dtype_columns and test_categorical_dtype_columns
Original issue raised privately by @helenajr
@zmek hello! I was previously working with a version of the package installed from the fix-modelling-issues branch and everything worked fine. I've now switched to using the released version of the package (so I can deploy it on the server) and this chunk (train classifier) has stopped working. I get the same error whether I'm using the latest release or v1.1.2
Train the models
prediction_times = filtered_snapshots['prediction_time'].unique().tolist()
display(Markdown(f'**Prediction time list:** {prediction_times}'))
display(Markdown(f'**Iterating through prediction times and training model for each time.....**'))
trained_models = {}
for prediction_time in prediction_times:
# train model
model = train_classifier(
train_visits=train_visits,
valid_visits=valid_visits,
test_visits=test_visits,
grid={
"n_estimators": [5, 8, 10, 20, 30]
},
prediction_time=prediction_time,
exclude_from_training_data=exclude_from_training_data,
ordinal_mappings={'acuity': [1, 2, 3, 4, 5]},
single_snapshot_per_visit=True,
visit_col='ENCNTR_ID', # as we are using a single snapshot per visit, we need to specify which column contains the visit number
use_balanced_training=True,
calibrate_probabilities=True,
calibration_method='sigmoid',
evaluate_on_test=True, # by default, this is set to False; only evaluate on the test set when happy with validation set
label_col = outcome_variable
)
# print time
The error I get is:
ValueError Traceback (most recent call last) in ?() 5 trained_models = {} 6 for prediction_time in prediction_times: 7 8 # train model ----> 9 model = train_classifier( 10 train_visits=train_visits, 11 valid_visits=valid_visits, 12 test_visits=test_visits,
ValueError: could not convert string to float: 'Ambulance'
Bug:
ValueError: could not convert string to floatwhen training classifiers with non-object string columnsDescription
Bug found by @helenajr - original request copied below.
Training a classifier fails with
ValueError: could not convert string to float: 'Ambulance'when categorical columns (e.g.arrival_method) use pandasStringDtype("string") orCategoricalDtype("category") instead of the legacy"object"dtype.Root cause
create_column_transformerusesdf[col].dtype == "object"to detect columns that need one-hot encoding. This check misses two other common string-like dtypes:StringDtype— used by default in some newer pandas configurations, or when users call.astype("string")CategoricalDtype— common when users optimise memory with.astype("category")When the check fails, the column falls through to the
StandardScaler()branch, which attempts to convert strings to floats.The same narrow check exists in
FeatureColumnTransformer.fitfor computing per-column default values.Traceback
Fix
Replace the
dtype == "object"checks with a helper that covers all three string-like dtypes:Files changed:
src/patientflow/train/classifiers.py— added_is_string_like_columnhelper; updatedcreate_column_transformerandFeatureColumnTransformer.fittests/test_classifiers.py— addedtest_string_dtype_columnsandtest_categorical_dtype_columnsOriginal issue raised privately by @helenajr