Methodology
See recent articles
Showing new listings for Thursday, 28 November 2024
- [1] arXiv:2411.17841 [pdf, html, other]
-
Title: Defective regression models for cure rate modeling in Marshall-Olkin familySubjects: Methodology (stat.ME); Applications (stat.AP)
Regression model have a substantial impact on interpretation of treatments, genetic characteristics and other covariates in survival analysis. In many datasets, the description of censoring and survival curve reveals the presence of cure fraction on data, which leads to alternative modelling. The most common approach to introduce covariates under a parametric estimation are the cure rate models and their variations, although the use of defective distributions have introduced a more parsimonious and integrated approach. Defective distributions is given by a density function whose integration is not one after changing the domain of one the parameters. In this work, we introduce two new defective regression models for long-term survival data in the Marshall-Olkin family: the Marshall-Olkin Gompertz and the Marshall-Olkin inverse Gaussian. The estimation process is conducted using the maximum likelihood estimation and Bayesian inference. We evaluate the asymptotic properties of the classical approach in Monte Carlo studies as well as the behavior of Bayes estimates with vague information. The application of both models under classical and Bayesian inferences is provided in an experiment of time until death from colon cancer with a dichotomous covariate. The Marshall-Olkin Gompertz regression presented the best adjustment and we present some global diagnostic and residual analysis for this proposal.
- [2] arXiv:2411.17859 [pdf, html, other]
-
Title: Sparse twoblock dimension reduction for simultaneous compression and variable selection in two blocks of variablesSubjects: Methodology (stat.ME); Computation (stat.CO)
A method is introduced to perform simultaneous sparse dimension reduction on two blocks of variables. Beyond dimension reduction, it also yields an estimator for multivariate regression with the capability to intrinsically deselect uninformative variables in both independent and dependent blocks. An algorithm is provided that leads to a straightforward implementation of the method. The benefits of simultaneous sparse dimension reduction are shown to carry through to enhanced capability to predict a set of multivariate dependent variables jointly. Both in a simulation study and in two chemometric applications, the new method outperforms its dense counterpart, as well as multivariate partial least squares.
- [3] arXiv:2411.17905 [pdf, html, other]
-
Title: Repeated sampling of different individuals but the same clusters to improve precision of difference-in-differences estimators: the DISC designSubjects: Methodology (stat.ME)
We describe the DISC (Different Individuals, Same Clusters) design, a sampling scheme that can improve the precision of difference-in-differences (DID) estimators in settings involving repeated sampling of a population at multiple time points. Although cohort designs typically lead to more efficient DID estimators relative to repeated cross-sectional (RCS) designs, they are often impractical in practice due to high rates of loss-to-follow-up, individuals leaving the risk set, or other reasons. The DISC design represents a hybrid between a cohort sampling design and a RCS sampling design, an alternative strategy in which the researcher takes a single sample of clusters, but then takes different cross-sectional samples of individuals within each cluster at two or more time points. We show that the DISC design can yield DID estimators with much higher precision relative to a RCS design, particularly if random cluster effects are present in the data-generating mechanism. For example, for a design in which 40 clusters and 25 individuals per cluster are sampled (for a total sample size of n=1,000), the variance of a commonly-used DID treatment effect estimator is 2.3 times higher in the RCS design for an intraclass correlation coefficient (ICC) of 0.05, 3.8 times higher for an ICC of 0.1, and 7.3 times higher for an ICC of 0.2.
- [4] arXiv:2411.17910 [pdf, html, other]
-
Title: Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological StudiesSubjects: Methodology (stat.ME); Applications (stat.AP)
In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediators. Analyzing causal mediation in such high-dimensional omics data presents substantial challenges, including complex dependencies among mediators and the need for advanced regularization or Bayesian techniques to ensure stable and interpretable estimation and selection of indirect effects. To this end, we propose a novel Bayesian framework for identifying active pathways and estimating indirect effects in the presence of high-dimensional multivariate mediators. Our approach adopts a multivariate stochastic search variable selection method, tailored for such complex mediation scenarios. Central to our method is the introduction of a set of priors for the selection: a Markov random field prior and sequential subsetting Bernoulli priors. The first prior's Markov property leverages the inherent correlations among mediators, thereby increasing power to detect mediated effects. The sequential subsetting aspect of the second prior encourages the simultaneous selection of relevant mediators and their corresponding indirect effects from the two model parts, providing a more coherent and efficient variable selection framework, specific to mediation analysis. Comprehensive simulation studies demonstrate that the proposed method provides superior power in detecting active mediating pathways. We further illustrate the practical utility of the method through its application to metabolome data from two cohort studies, highlighting its effectiveness in real data setting.
- [5] arXiv:2411.17983 [pdf, html, other]
-
Title: Optimized Conformal Selection: Powerful Selective Inference After Conformity Score OptimizationSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting.
This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation. - [6] arXiv:2411.18012 [pdf, html, other]
-
Title: Bayesian Inference of Spatially Varying Correlations via the Thresholded Correlation Gaussian ProcessSubjects: Methodology (stat.ME)
A central question in multimodal neuroimaging analysis is to understand the association between two imaging modalities and to identify brain regions where such an association is statistically significant. In this article, we propose a Bayesian nonparametric spatially varying correlation model to make inference of such regions. We build our model based on the thresholded correlation Gaussian process (TCGP). It ensures piecewise smoothness, sparsity, and jump discontinuity of spatially varying correlations, and is well applicable even when the number of subjects is limited or the signal-to-noise ratio is low. We study the identifiability of our model, establish the large support property, and derive the posterior consistency and selection consistency. We also develop a highly efficient Gibbs sampler and its variant to compute the posterior distribution. We illustrate the method with both simulations and an analysis of functional magnetic resonance imaging data from the Human Connectome Project.
- [7] arXiv:2411.18334 [pdf, html, other]
-
Title: Multi-response linear regression estimation based on low-rank pre-smoothingSubjects: Methodology (stat.ME)
Pre-smoothing is a technique aimed at increasing the signal-to-noise ratio in data to improve subsequent estimation and model selection in regression problems. However, pre-smoothing has thus far been limited to the univariate response regression setting. Motivated by the widespread interest in multi-response regression analysis in many scientific applications, this article proposes a technique for data pre-smoothing in this setting based on low-rank approximation. We establish theoretical results on the performance of the proposed methodology, and quantify its benefit empirically in a number of simulated experiments. We also demonstrate our proposed low-rank pre-smoothing technique on real data arising from the environmental and biological sciences.
- [8] arXiv:2411.18351 [pdf, html, other]
-
Title: On an EM-based closed-form solution for 2 parameter IRT modelsStefano Noventa (1), Roberto Faleh (1), Augustin Kelava (1) ((1) Methods Center, University of Tuebingen)Comments: 30, 6 figures, submitted to PsychometrikaSubjects: Methodology (stat.ME); Computation (stat.CO)
It is a well-known issue that in Item Response Theory models there is no closed-form for the maximum likelihood estimators of the item parameters. Parameter estimation is therefore typically achieved by means of numerical methods like gradient search. The present work has a two-fold aim: On the one hand, we revise the fundamental notions associated to the item parameter estimation in 2 parameter Item Response Theory models from the perspective of the complete-data likelihood. On the other hand, we argue that, within an Expectation-Maximization approach, a closed-form for discrimination and difficulty parameters can actually be obtained that simply corresponds to the Ordinary Least Square solution.
- [9] arXiv:2411.18398 [pdf, html, other]
-
Title: Derivative Estimation of Multivariate Functional DataSubjects: Methodology (stat.ME)
Existing approaches for derivative estimation are restricted to univariate functional data. We propose two methods to estimate the principal components and scores for the derivatives of multivariate functional data. As a result, the derivatives can be reconstructed by a multivariate Karhunen-Loève expansion. The first approach is an extended version of multivariate functional principal component analysis (MFPCA) which incorporates the derivatives, referred to as derivative MFPCA (DMFPCA). The second approach is based on the derivation of multivariate Karhunen-Loève (DMKL) expansion. We compare the performance of the two proposed methods with a direct approach in simulations. The simulation results indicate that DMFPCA outperforms DMKL and the direct approach, particularly for densely observed data. We apply DMFPCA and DMKL methods to coronary angiogram data to recover derivatives of diameter and quantitative flow ratio. We obtain the multivariate functional principal components and scores of the derivatives, which can be used to classify patterns of coronary artery disease.
- [10] arXiv:2411.18416 [pdf, html, other]
-
Title: Probabilistic size-and-shape functional mixed modelsComments: NeurIPS 2024Subjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
The reliable recovery and uncertainty quantification of a fixed effect function $\mu$ in a functional mixed model, for modelling population- and object-level variability in noisily observed functional data, is a notoriously challenging task: variations along the $x$ and $y$ axes are confounded with additive measurement error, and cannot in general be disentangled. The question then as to what properties of $\mu$ may be reliably recovered becomes important. We demonstrate that it is possible to recover the size-and-shape of a square-integrable $\mu$ under a Bayesian functional mixed model. The size-and-shape of $\mu$ is a geometric property invariant to a family of space-time unitary transformations, viewed as rotations of the Hilbert space, that jointly transform the $x$ and $y$ axes. A random object-level unitary transformation then captures size-and-shape \emph{preserving} deviations of $\mu$ from an individual function, while a random linear term and measurement error capture size-and-shape \emph{altering} deviations. The model is regularized by appropriate priors on the unitary transformations, posterior summaries of which may then be suitably interpreted as optimal data-driven rotations of a fixed orthonormal basis for the Hilbert space. Our numerical experiments demonstrate utility of the proposed model, and superiority over the current state-of-the-art.
- [11] arXiv:2411.18433 [pdf, html, other]
-
Title: A Latent Space Approach to Inferring Distance-Dependent Reciprocity in Directed NetworksComments: 21 pages, 10 figures, 3 tablesSubjects: Methodology (stat.ME)
Reciprocity, or the stochastic tendency for actors to form mutual relationships, is an essential characteristic of directed network data. Existing latent space approaches to modeling directed networks are severely limited by the assumption that reciprocity is homogeneous across the network. In this work, we introduce a new latent space model for directed networks that can model heterogeneous reciprocity patterns that arise from the actors' latent distances. Furthermore, existing edge-independent latent space models are nested within the proposed model class, which allows for meaningful model comparisons. We introduce a Bayesian inference procedure to infer the model parameters using Hamiltonian Monte Carlo. Lastly, we use the proposed method to infer different reciprocity patterns in an advice network among lawyers, an information-sharing network between employees at a manufacturing company, and a friendship network between high school students.
- [12] arXiv:2411.18481 [pdf, other]
-
Title: Bhirkuti's Test of Bias Acceptance: Examining in Psychometric SimulationsSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO); Other Statistics (stat.OT)
This study introduces Bhirkuti's Test of Bias Acceptance, a systematic graphical framework for evaluating bias and determining its acceptability under varying experimental conditions. Absolute Relative Bias (ARB), while useful for understanding bias, is sensitive to outliers and population parameter magnitudes, often overstating bias for small values and understating it for larger ones. Similarly, Relative Efficiency (RE) can be influenced by variance differences and outliers, occasionally producing counterintuitive values exceeding 100%, which complicates interpretation. By addressing the limitations of traditional metrics such as Absolute Relative Bias (ARB) and Relative Efficiency (RE), the proposed graphical methodology framework leverages ridgeline plots and standardized estimate to provide a comprehensive visualization of parameter estimate distributions. Ridgeline plots done this way offer a robust alternative by visualizing full distributions, highlighting variability, trends, outliers, descriptive and facilitating more informed decision-making. This study employs multivariate Latent Growth Models (LGM) and Monte Carlo simulations to examine the performance of growth curve modeling under planned missing data designs, focusing on parameter estimate recovery and efficiency. By combining innovative visualization techniques with rigorous simulation methods, Bhirkuti's Test of Bias Acceptance provides a versatile and interpretable toolset for advancing quantitative research in bias evaluation and efficiency assessment.
- [13] arXiv:2411.18510 [pdf, html, other]
-
Title: A subgroup-aware scoring approach to the study of effect modification in observational studiesSubjects: Methodology (stat.ME)
Effect modification means the size of a treatment effect varies with an observed covariate. Generally speaking, a larger treatment effect with more stable error terms is less sensitive to bias. Thus, we might be able to conclude that a study is less sensitive to unmeasured bias by using these subgroups experiencing larger treatment effects. Lee et al. (2018) proposed the submax method that leverages the joint distribution of test statistics from subgroups to draw a firmer conclusion if effect modification occurs. However, one version of the submax method uses M-statistics as the test statistics and is implemented in the R package submax (Rosenbaum, 2017). The scaling factor in the M-statistics is computed using all observations combined across subgroups. We show that this combining can confuse effect modification with outliers. We propose a novel group M-statistic that scores the matched pairs in each subgroup to tackle the issue. We examine our novel scoring strategy in extensive settings to show the superior performance. The proposed method is applied to an observational study of the effect of a malaria prevention treatment in West Africa.
- [14] arXiv:2411.18549 [pdf, html, other]
-
Title: Finite population inference for skewness measuresSubjects: Methodology (stat.ME)
In this article we consider Bowley's skewness measure and the Groeneveld-Meeden $b_{3}$ index in the context of finite population sampling. We employ the functional delta method to obtain asymptotic variance formulae for plug-in estimators and propose corresponding variance estimators. We then consider plug-in estimators based on the Hájek cdf-estimator and on a Deville-Särndal type calibration estimator and test the performance of normal confidence intervals.
New submissions (showing 14 of 14 entries)
- [15] arXiv:2411.16552 (cross-list from stat.AP) [pdf, html, other]
-
Title: When Is Heterogeneity Actionable for Personalization?Subjects: Applications (stat.AP); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
Targeting and personalization policies can be used to improve outcomes beyond the uniform policy that assigns the best performing treatment in an A/B test to everyone. Personalization relies on the presence of heterogeneity of treatment effects, yet, as we show in this paper, heterogeneity alone is not sufficient for personalization to be successful. We develop a statistical model to quantify "actionable heterogeneity," or the conditions when personalization is likely to outperform the best uniform policy. We show that actionable heterogeneity can be visualized as crossover interactions in outcomes across treatments and depends on three population-level parameters: within-treatment heterogeneity, cross-treatment correlation, and the variation in average responses. Our model can be used to predict the expected gain from personalization prior to running an experiment and also allows for sensitivity analysis, providing guidance on how changing treatments can affect the personalization gain. To validate our model, we apply five common personalization approaches to two large-scale field experiments with many interventions that encouraged flu vaccination. We find an 18% gain from personalization in one and a more modest 4% gain in the other, which is consistent with our model. Counterfactual analysis shows that this difference in the gains from personalization is driven by a drastic difference in within-treatment heterogeneity. However, reducing cross-treatment correlation holds a larger potential to further increase personalization gains. Our findings provide a framework for assessing the potential from personalization and offer practical recommendations for improving gains from targeting in multi-intervention settings.
- [16] arXiv:2411.17808 (cross-list from stat.CO) [pdf, html, other]
-
Title: spar: Sparse Projected Averaged Regression in RComments: 32 pages, 5 figuresSubjects: Computation (stat.CO); Methodology (stat.ME)
Package spar for R builds ensembles of predictive generalized linear models with high-dimensional predictors. It employs an algorithm utilizing variable screening and random projection tools to efficiently handle the computational challenges associated with large sets of predictors. The package is designed with a strong focus on extensibility. Screening and random projection techniques are implemented as S3 classes with user-friendly constructor functions, enabling users to easily integrate and develop new procedures. This design enhances the package's adaptability and makes it a powerful tool for a variety of high-dimensional applications.
- [17] arXiv:2411.17989 (cross-list from cs.LG) [pdf, html, other]
-
Title: Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal DiscoverySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
As the significance of understanding the cause-and-effect relationships among variables increases in the development of modern systems and algorithms, learning causality from observational data has become a preferred and efficient approach over conducting randomized control trials. However, purely observational data could be insufficient to reconstruct the true causal graph. Consequently, many researchers tried to utilise some form of prior knowledge to improve causal discovery process. In this context, the impressive capabilities of large language models (LLMs) have emerged as a promising alternative to the costly acquisition of prior expert knowledge. In this work, we further explore the potential of using LLMs to enhance causal discovery approaches, particularly focusing on score-based methods, and we propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.
- [18] arXiv:2411.18008 (cross-list from cs.LG) [pdf, html, other]
-
Title: Causal and Local Correlations Based Network for Multivariate Time Series ClassificationComments: Submitted on April 03, 2023; major revisions on March 25, 2024; minor revisions on July 9, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Recently, time series classification has attracted the attention of a large number of researchers, and hundreds of methods have been proposed. However, these methods often ignore the spatial correlations among dimensions and the local correlations among features. To address this issue, the causal and local correlations based network (CaLoNet) is proposed in this study for multivariate time series classification. First, pairwise spatial correlations between dimensions are modeled using causality modeling to obtain the graph structure. Then, a relationship extraction network is used to fuse local correlations to obtain long-term dependency features. Finally, the graph structure and long-term dependency features are integrated into the graph neural network. Experiments on the UEA datasets show that CaLoNet can obtain competitive performance compared with state-of-the-art methods.
- [19] arXiv:2411.18502 (cross-list from stat.ML) [pdf, other]
-
Title: Isometry pursuitSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.
- [20] arXiv:2411.18569 (cross-list from stat.ML) [pdf, html, other]
-
Title: A Flexible Defense Against the Winner's CurseSubjects: Machine Learning (stat.ML); Statistics Theory (math.ST); Methodology (stat.ME)
Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly, in machine learning, practitioners are often interested in the population performance of the model that performs best empirically. However, cherry-picking the best candidate leads to the winner's curse: the observed performance for the winner is biased upwards, rendering conclusions based on standard measures of uncertainty invalid. We introduce the zoom correction, a novel approach for valid inference on the winner. Our method is flexible: it can be employed in both parametric and nonparametric settings, can handle arbitrary dependencies between candidates, and automatically adapts to the level of selection bias. The method easily extends to important related problems, such as inference on the top k winners, inference on the value and identity of the population winner, and inference on "near-winners."
Cross submissions (showing 6 of 6 entries)
- [21] arXiv:2307.16353 (replaced) [pdf, html, other]
-
Title: Single Proxy Synthetic ControlSubjects: Methodology (stat.ME)
Synthetic control methods are widely used to estimate the treatment effect on a single treated unit in time-series settings. A common approach to estimate synthetic control weights is to regress the treated unit's pre-treatment outcome and covariates' time series measurements on those of untreated units via ordinary least squares. However, this approach can perform poorly if the pre-treatment fit is not near perfect, whether the weights are normalized or not. In this paper, we introduce a single proxy synthetic control approach, which views the outcomes of untreated units as proxies of the treatment-free potential outcome of the treated unit, a perspective we leverage to construct a valid synthetic control. Under this framework, we establish an alternative identification strategy and corresponding estimation methods for synthetic controls and the treatment effect on the treated unit. Notably, unlike existing proximal synthetic control methods, which require two types of proxies for identification, ours relies on a single type of proxy, thus facilitating its practical relevance. Additionally, we adapt a conformal inference approach to perform inference about the treatment effect, obviating the need for a large number of post-treatment observations. Lastly, our framework can accommodate time-varying covariates and nonlinear models. We demonstrate the proposed approach in a simulation study and a real-world application.
- [22] arXiv:2405.13799 (replaced) [pdf, html, other]
-
Title: Extending Kernel Testing To General DesignsComments: 10 pages, 3 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Kernel-based testing has revolutionized the field of non-parametric tests through the embedding of distributions in an RKHS. This strategy has proven to be powerful and flexible, yet its applicability has been limited to the standard two-sample case, while practical situations often involve more complex experimental designs. To extend kernel testing to any design, we propose a linear model in the RKHS that allows for the decomposition of mean embeddings into additive functional effects. We then introduce a truncated kernel Hotelling-Lawley statistic to test the effects of the model, demonstrating that its asymptotic distribution is chi-square, which remains valid with its Nystrom approximation. We discuss a homoscedasticity assumption that, although absent in the standard two-sample case, is necessary for general designs. Finally, we illustrate our framework using a single-cell RNA sequencing dataset and provide kernel-based generalizations of classical diagnostic and exploration tools to broaden the scope of kernel testing in any experimental design.
- [23] arXiv:2409.11167 (replaced) [pdf, html, other]
-
Title: Poisson and Gamma Model Marginalisation and Marginal Likelihood calculation using Moment-generating FunctionsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We present a new analytical method to derive the likelihood function that has the population of parameters marginalised out in Bayesian hierarchical models. This method is also useful to find the marginal likelihoods in Bayesian models or in random-effect linear mixed models. The key to this method is to take high-order (sometimes fractional) derivatives of the prior moment-generating function if particular existence and differentiability conditions hold.
In particular, this analytical method assumes that the likelihood is either Poisson or gamma. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.
We also present some examples validating this new analytical method. - [24] arXiv:2410.19019 (replaced) [pdf, other]
-
Title: Median Based Unit Weibull (MBUW): a new unit distribution PropertiesComments: arXiv admin note: text overlap with arXiv:2410.04132 , this a new update as the new update contains real data anlysis ( 6 data sets ) to illustrate the benefit of the distributionSubjects: Methodology (stat.ME); Probability (math.PR)
A new 2 parameter unit Weibull distribution is defined on the unit interval (0,1). The methodology of deducing its PDF, some of its properties and related functions are discussed. The paper is supplied by many figures illustrating the new distribution and how this can make it illegible to fit a wide range of skewed data. The new distribution holds a name (Attia) as a nickname.
- [25] arXiv:2411.15691 (replaced) [pdf, other]
-
Title: Data integration using covariate summaries from external sourcesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In modern data analysis, information is frequently collected from multiple sources, often leading to challenges such as data heterogeneity and imbalanced sample sizes across datasets. Robust and efficient data integration methods are crucial for improving the generalization and transportability of statistical findings. In this work, we address scenarios where, in addition to having full access to individualized data from a primary source, supplementary covariate information from external sources is also available. While traditional data integration methods typically require individualized covariates from external sources, such requirements can be impractical due to limitations related to accessibility, privacy, storage, and cost. Instead, we propose novel data integration techniques that rely solely on external summary statistics, such as sample means and covariances, to construct robust estimators for the mean outcome under both homogeneous and heterogeneous data settings. Additionally, we extend this framework to causal inference, enabling the estimation of average treatment effects for both generalizability and transportability.
- [26] arXiv:2411.16831 (replaced) [pdf, html, other]
-
Title: Measuring Statistical Evidence: A Short ReportSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
This short text tried to establish a big picture of what evidential statistics is about and how an ideal inference method should behave. Moreover, by examining shortcomings of some of the currently used methods for measuring evidence and utilizing some intuitive principles, we motivated the Relative Belief Ratio as the primary method of characterizing statistical evidence. Number of topics has been omitted for the interest of this text and the reader is strongly advised to refer to (Evans, 2015) as the primary source for further readings of the subject.
- [27] arXiv:2411.17033 (replaced) [pdf, html, other]
-
Title: Quantile Graph Discovery through QuACC: Quantile Association via Conditional ConcordanceSubjects: Methodology (stat.ME)
Graphical structure learning is an effective way to assess and visualize cross-biomarker dependencies in biomedical settings. Standard approaches to estimating graphs rely on conditional independence tests that may not be sensitive to associations that manifest at the tails of joint distributions, i.e., they may miss connections among variables that exhibit associations mainly at lower or upper quantiles. In this work, we propose a novel measure of quantile-specific conditional association called QuACC: Quantile Association via Conditional Concordance. For a pair of variables and a conditioning set, QuACC quantifies agreement between the residuals from two quantile regression models, which may be linear or more complex, e.g., quantile forests. Using this measure as the basis for a test of null (quantile) association, we introduce a new class of quantile-specific graphical models. Through simulation we show our method is powerful for detecting dependencies under dependencies that manifest at the tails of distributions. We apply our method to biobank data from All of Us and identify quantile-specific patterns of conditional association in a multivariate setting.
- [28] arXiv:2211.13612 (replaced) [pdf, html, other]
-
Title: Joint modeling of wind speed and wind direction through a conditional approachComments: 29 pages, 15 figuresSubjects: Applications (stat.AP); Methodology (stat.ME)
Atmospheric near surface wind speed and wind direction play an important role in many applications, ranging from air quality modeling, building design, wind turbine placement to climate change research. It is therefore crucial to accurately estimate the joint probability distribution of wind speed and direction. In this work we develop a conditional approach to model these two variables, where the joint distribution is decomposed into the product of the marginal distribution of wind direction and the conditional distribution of wind speed given wind direction. To accommodate the circular nature of wind direction a von Mises mixture model is used; the conditional wind speed distribution is modeled as a directional dependent Weibull distribution via a two-stage estimation procedure, consisting of a directional binned Weibull parameter estimation, followed by a harmonic regression to estimate the dependence of the Weibull parameters on wind direction. A Monte Carlo simulation study indicates that our method outperforms an alternative method that uses periodic spline quantile regression in terms of estimation efficiency. We illustrate our method by using the output from a regional climate model to investigate how the joint distribution of wind speed and direction may change under some future climate scenarios.
- [29] arXiv:2403.07657 (replaced) [pdf, other]
-
Title: Scalable Spatiotemporal Prediction with Bayesian Neural FieldsComments: 29 pages, 7 figures, 2 tables, 1 listingJournal-ref: Nature Communications 15(7942), 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in diverse applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As the scale of modern datasets increases, there is a growing need for statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle many observations. This article introduces the Bayesian Neural Field (BayesNF), a domain-general statistical model that infers rich spatiotemporal probability distributions for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust predictive uncertainty quantification. Evaluations against prominent baselines show that BayesNF delivers improvements on prediction problems from climate and public health data containing tens to hundreds of thousands of measurements. Accompanying the paper is an open-source software package (this https URL) that runs on GPU and TPU accelerators through the JAX machine learning platform.
- [30] arXiv:2410.06163 (replaced) [pdf, html, other]
-
Title: Markov Equivalence and Consistency in Differentiable Structure LearningComments: 38 pages, 14 figures, to appear at NeurIPS 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses.