TY - JOUR
T1 - Limitations of Influence-Based Dataset Compression for Waste Classification
AU - Aberger, Julian
AU - Brensberger, Lena
AU - Koinig, Gerald
AU - Häcker, Benedikt
AU - Pestana, Jesus
AU - Sarc, Renato
PY - 2025/8/7
Y1 - 2025/8/7
N2 - Influence-based data selection methods, such as TracIn, aim to estimate the impact of individual training samples on model predictions and are increasingly used for dataset curation and reduction. This study investigates whether selecting the most positively influential training examples can be used to create compressed yet effective training datasets for transfer learning in plastic waste classification. Using a ResNet-18 model trained on a custom dataset of plastic waste images, TracIn was applied to compute influence scores across multiple training checkpoints. The top 50 influential samples per class were extracted and used to train a new model. Contrary to expectations, models trained on these highly influential subsets significantly underperformed compared to models trained on either the full dataset or an equally sized random sample. Further analysis revealed that many top-ranked influential images originated from different classes, indicating model biases and potential label confusion. These findings highlight the limitations of using influence scores for dataset compression. However, TracIn proved valuable for identifying problematic or ambiguous samples, class imbalance issues, and issues with fuzzy class boundaries. Based on the results, the utilized TracIn approach is recommended as a diagnostic instrument rather than for dataset curation.
AB - Influence-based data selection methods, such as TracIn, aim to estimate the impact of individual training samples on model predictions and are increasingly used for dataset curation and reduction. This study investigates whether selecting the most positively influential training examples can be used to create compressed yet effective training datasets for transfer learning in plastic waste classification. Using a ResNet-18 model trained on a custom dataset of plastic waste images, TracIn was applied to compute influence scores across multiple training checkpoints. The top 50 influential samples per class were extracted and used to train a new model. Contrary to expectations, models trained on these highly influential subsets significantly underperformed compared to models trained on either the full dataset or an equally sized random sample. Further analysis revealed that many top-ranked influential images originated from different classes, indicating model biases and potential label confusion. These findings highlight the limitations of using influence scores for dataset compression. However, TracIn proved valuable for identifying problematic or ambiguous samples, class imbalance issues, and issues with fuzzy class boundaries. Based on the results, the utilized TracIn approach is recommended as a diagnostic instrument rather than for dataset curation.
UR - https://doi.org/10.3390/data10080127
U2 - 10.3390/data10080127
DO - 10.3390/data10080127
M3 - Article
VL - 2025
JO - Data
JF - Data
IS - Volume 10, Issue 8
M1 - 127
ER -