PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

Paiva, Pedro Yuri Arbs; Smith-Miles, Kate; Valeriano, Maria Gabriela; Lorena, Ana Carolina

Computer Science > Machine Learning

arXiv:2109.14430 (cs)

[Submitted on 29 Sep 2021]

Title:PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

Authors:Pedro Yuri Arbs Paiva, Kate Smith-Miles, Maria Gabriela Valeriano, Ana Carolina Lorena

View PDF

Abstract:For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.

Subjects:	Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2109.14430 [cs.LG]
	(or arXiv:2109.14430v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2109.14430

Submission history

From: Pedro Yuri Arbs Paiva [view email]
[v1] Wed, 29 Sep 2021 14:08:26 UTC (668 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2021-09

Change to browse by:

cs
cs.HC

References & Citations

DBLP - CS Bibliography

listing | bibtex

Kate Smith-Miles
Ana Carolina Lorena

export BibTeX citation

Computer Science > Machine Learning

Title:PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators