The exact probability law for the approximated similarity from the Minhashing method

Dembele, Soumaila; Lo, Gane Samb

doi:10.16929/as/2017.1199.100

Mathematics > Probability

arXiv:2209.10031 (math)

[Submitted on 20 Sep 2022 (v1), last revised 25 Sep 2022 (this version, v2)]

Title:The exact probability law for the approximated similarity from the Minhashing method

Authors:Soumaila Dembele, Gane Samb Lo

View PDF

Abstract:We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman \textit{RU} algorithm and a modified version of it denoted by \textit{RUM}. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given.
\noindent \textbf{Résumé.} Nous proposons un cadre probabilistique dans lequel nous étudions la loi de probabilité de l'algorithme de Rajaraman et Ullman \textit{RU} ainsi qu'une version modifiée de cet algorithme notée \textit{RUM}. Ces alogrithmes visent à estimer l'indice de la similarité entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validité de cette méthode en montrant que pour des lois de probabilités minutieusement choisies, la similarité exacte est l'espérance mathématique de la similarité aléatoire donnée par l'algorithme \textit{RUM}. Des généralisations sont abordées.

Subjects:	Probability (math.PR); Applications (stat.AP)
MSC classes:	62E15, 62F12, 68R05, 68R15, 68Q97
Cite as:	arXiv:2209.10031 [math.PR]
	(or arXiv:2209.10031v2 [math.PR] for this version)
	https://doi.org/10.48550/arXiv.2209.10031
Journal reference:	Afrika Statistika Vol. 12, Issue 1 (Apr 2017), pg(s) 1199-1218
Related DOI:	https://doi.org/10.16929/as/2017.1199.100

Submission history

From: Gane Samb Lo [view email]
[v1] Tue, 20 Sep 2022 22:43:12 UTC (17 KB)
[v2] Sun, 25 Sep 2022 17:38:33 UTC (17 KB)

Mathematics > Probability

Title:The exact probability law for the approximated similarity from the Minhashing method

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Mathematics > Probability

Title:The exact probability law for the approximated similarity from the Minhashing method

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators