Predicting Labels
Module for training and applying a text classification model.
This class streamlines the process of fine-tuning a transformer-based classifier on labeled data and applying the trained model to annotate new, unlabeled datasets. Supports both single and multi-column predictions and includes optional model saving and evaluation output.
Attributes:
Name | Type | Description |
---|---|---|
model_name |
str
|
Name of the pretrained Hugging Face model to fine-tune (default: "distilbert-base-uncased"). |
Methods:
Name | Description |
---|---|
run_pipeline |
Trains the classifier and returns a DataFrame with predicted labels and confidence scores. |
Source code in src/educhateval/core.py
337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 |
|
run_pipeline(train_data, new_data, text_column='text', label_column='category', columns_to_classify=None, split_ratio=0.2, training_params=[0.01, 'cross_entropy', 5e-05, 8, 8, 4, 0.01], tuning=False, tuning_params=None, model_save_path=None, prediction_save_path=None, seed=42)
This function handles the full pipeline of loading data, preparing datasets, tokenizing inputs, training a transformer-based classifier, and applying it to specified text columns in new data. It supports custom hyperparameters, optional hyperparameter tuning, and saving of both the trained model and prediction outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_data
|
Union[str, DataFrame]
|
Labeled dataset for training. Can be a DataFrame or a CSV file path. |
required |
new_data
|
Union[str, DataFrame]
|
Dataset to annotate with predicted labels. Can be a DataFrame or a CSV file path. |
required |
text_column
|
str
|
Column in the training data containing the input text. Defaults to "text". |
'text'
|
label_column
|
str
|
Column in the training data containing the target labels. Defaults to "category". |
'category'
|
columns_to_classify
|
Optional[Union[str, List[str]]]
|
Column(s) in |
None
|
split_ratio
|
float
|
Ratio of data to use for validation. Must be between 0 and 1. Defaults to 0.2. |
0.2
|
training_params
|
list
|
List of 7 training hyperparameters: [weight_decay, loss_fn, learning_rate, batch_size, num_epochs, warmup_steps, gradient_accumulation]. Defaults to [0.01, "cross_entropy", 5e-5, 8, 8, 4, 0.01]. |
[0.01, 'cross_entropy', 5e-05, 8, 8, 4, 0.01]
|
tuning
|
bool
|
Whether to perform hyperparameter tuning. Defaults to False. |
False
|
tuning_params
|
Optional[dict]
|
Dictionary of tuning settings if |
None
|
model_save_path
|
Optional[str]
|
Optional path to save the trained model and tokenizer. Defaults to None. |
None
|
prediction_save_path
|
Optional[str]
|
Optional path to save annotated predictions as a CSV. Defaults to None. |
None
|
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing the original |
Source code in src/educhateval/core.py
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 |
|