Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Zhang, Chihao; Chen, Yiling Elaine; Zhang, Shihua; Li, Jingyi Jessica

Statistics > Machine Learning

arXiv:2109.00582 (stat)

[Submitted on 1 Sep 2021 (v1), last revised 2 Jul 2022 (this version, v3)]

Title:Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Authors:Chihao Zhang, Yiling Elaine Chen, Shihua Zhang, Jingyi Jessica Li

View PDF

Abstract:Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at this https URL) implements ITCA and search strategies.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
MSC classes:	62-08
Cite as:	arXiv:2109.00582 [stat.ML]
	(or arXiv:2109.00582v3 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2109.00582

Submission history

From: Jingyi Jessica Li [view email]
[v1] Wed, 1 Sep 2021 19:20:28 UTC (9,811 KB)
[v2] Fri, 17 Sep 2021 17:56:20 UTC (13,891 KB)
[v3] Sat, 2 Jul 2022 05:44:29 UTC (9,416 KB)

Statistics > Machine Learning

Title:Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators