Performance
See recent articles
Showing new listings for Wednesday, 22 January 2025
- [1] arXiv:2501.12037 [pdf, html, other]
-
Title: A Stochastic Geometry Based Techno-Economic Analysis of RIS-Assisted Cellular NetworksComments: This document supports a work submitted to WiOpt2025, including the supplementary mathematical verification in the appendixSubjects: Performance (cs.PF); Networking and Internet Architecture (cs.NI)
Reconfigurable intelligent surfaces (RISs) are a promising technology for enhancing cellular network performance and yielding additional value to network operators. This paper proposes a techno-economic analysis of RIS-assisted cellular networks to guide operators in deciding between deploying additional RISs or base stations (BS). We assume a relative cost model that considers the total cost of ownership (TCO) of deploying additional nodes, either BSs or RISs. We assume a return on investment (RoI) that is proportional to the system's spectral efficiency. The latter is evaluated based on a stochastic geometry model that gives an integral formula for the ergodic rate in cellular networks equipped with RISs. The marginal RoI for any investment strategy is determined by the partial derivative of this integral expression with respect to node densities. We investigate two case studies: throughput enhancement and coverage hole mitigation. These examples demonstrate how operators could determine the optimal investment strategy in scenarios defined by the current densities of BSs and RISs, and their relative costs. Numerical results illustrate the evolution of ergodic rates based on the proposed investment strategy, demonstrating the investment decision-making process while considering technological and economic factors. This work quantitatively demonstrates that strategically investing in RISs can offer better system-level benefits than solely investing in BS densification.
New submissions (showing 1 of 1 entries)
- [2] arXiv:2501.11006 (cross-list from cs.DC) [pdf, html, other]
-
Title: GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code GenerationComments: Under submission in ACM/IEEE conference, 11 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
Large Language Models (LLMs) are becoming integral to daily life, showcasing their vast potential across various Natural Language Processing (NLP) tasks. Beyond NLP, LLMs are increasingly used in software development tasks, such as code completion, modification, bug fixing, and code translation. Software engineers widely use tools like GitHub Copilot and Amazon Q, streamlining workflows and automating tasks with high accuracy. While the resource and energy intensity of LLM training is often highlighted, inference can be even more resource-intensive over time, as it's a continuous process with a high number of invocations. Therefore, developing resource-efficient alternatives for LLM inference is crucial for sustainability. This work proposes GREEN-CODE, a framework for energy-aware code generation in LLMs. GREEN-CODE performs dynamic early exit during LLM inference. We train a Reinforcement Learning (RL) agent that learns to balance the trade-offs between accuracy, latency, and energy consumption. Our approach is evaluated on two open-source LLMs, Llama 3.2 3B and OPT 2.7B, using the JavaCorpus and PY150 datasets. Results show that our method reduces the energy consumption between 23-50 % on average for code generation tasks without significantly affecting accuracy.
- [3] arXiv:2501.11558 (cross-list from cs.CR) [pdf, html, other]
-
Title: A performance analysis of VM-based Trusted Execution Environments for Confidential Federated LearningSubjects: Cryptography and Security (cs.CR); Performance (cs.PF)
Federated Learning (FL) is a distributed machine learning approach that has emerged as an effective way to address recent privacy concerns. However, FL introduces the need for additional security measures as FL alone is still subject to vulnerabilities such as model and data poisoning and inference attacks. Confidential Computing (CC) is a paradigm that, by leveraging hardware-based trusted execution environments (TEEs), protects the confidentiality and integrity of ML models and data, thus resulting in a powerful ally of FL applications. Typical TEEs offer an application-isolation level but suffer many drawbacks, such as limited available memory and debugging and coding difficulties. The new generation of TEEs offers a virtual machine (VM)-based isolation level, thus reducing the porting effort for existing applications. In this work, we compare the performance of VM-based and application-isolation level TEEs for confidential FL (CFL) applications. In particular, we evaluate the impact of TEEs and additional security mechanisms such as TLS (for securing the communication channel). The results, obtained across three datasets and two deep learning models, demonstrate that the new VM-based TEEs introduce a limited overhead (at most 1.5x), thus paving the way to leverage public and untrusted computing environments, such as HPC facilities or public cloud, without detriment to performance.
- [4] arXiv:2501.11779 (cross-list from cs.LG) [pdf, html, other]
-
Title: Glinthawk: A Two-Tiered Architecture for High-Throughput LLM InferenceSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Large Language Models (LLM) have revolutionized natural language processing, but their inference demands substantial resources, while under-utilizing high-end accelerators like GPUs. A major bottleneck arises from the attention mechanism, which requires storing large key-value caches, limiting the maximum achievable throughput way below the available computing resources. Current approaches attempt to mitigate this issue through memory-efficient attention and paging mechanisms, but remained constrained by the assumption that all operations must be performed on high-end accelerators.
In this work, we propose Glinthawk, a two-tiered architecture that decouples the attention mechanism from the rest of the Transformer model. This approach allows the memory requirements for attention to scale independently, enabling larger batch sizes and more efficient use of the high-end accelerators. We prototype Glinthawk with NVIDIA T4 GPUs as one tier and standard CPU VMs as the other. Compared to a traditional single-tier setup, it improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$. For longer sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-oriented applications such as batch processing. We shared our prototype publicly at \url{this https URL}. - [5] arXiv:2501.12084 (cross-list from cs.DC) [pdf, html, other]
-
Title: Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level AnalysisComments: arXiv admin note: substantial text overlap with arXiv:2402.13499Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Performance (cs.PF)
Modern GPUs, with their specialized hardware like tensor cores, are essential for demanding AI and deep learning applications. This study presents a comprehensive, multi-level microbenchmarking analysis of the NVIDIA Hopper GPU architecture, delving into its performance characteristics and novel features. We benchmark Hopper's memory subsystem latency and throughput, comparing its L2 partitioned cache behavior and global memory access patterns against recent GPU generations, Ampere and Ada Lovelace. Our analysis reveals significant performance differences and architectural improvements in Hopper. A core contribution of this work is a detailed evaluation of Hopper's fourth-generation tensor cores, including their FP8 precision support and the novel asynchronous wgmma instructions, assessing their impact on matrix multiply-accumulate operations. We further investigate the performance implications of other key Hopper innovations: DPX instructions for accelerating dynamic programming algorithms, distributed shared memory (DSM) for inter-SM communication, and the Tensor Memory Accelerator (TMA) for asynchronous data movement. This multi-level approach encompasses instruction-level microbenchmarks, library-level analysis of the Transformer Engine, and application-level benchmarks of tensor core performance within large language models. Our findings provide valuable, in-depth insights for software developers seeking to optimize performance and develop accurate performance models for the Hopper architecture, ultimately contributing to a deeper understanding of its potential for accelerating AI and other computationally intensive workloads.
Cross submissions (showing 4 of 4 entries)
- [6] arXiv:2405.14209 (replaced) [pdf, html, other]
-
Title: Exploring and Evaluating Real-world CXL: Use Cases and System AdoptionSubjects: Performance (cs.PF); Hardware Architecture (cs.AR)
Compute eXpress Link (CXL) is emerging as a promising memory interface technology. Because of the common unavailability of CXL devices, the performance of the CXL memory is largely unknown. What are the use cases for the CXL memory? What are the impacts of the CXL memory on application performance? How to use the CXL memory in combination with existing memory components? In this work, we study the performance of three genuine CXL memory-expansion cards from different vendors. We characterize the basic performance of the CXL memory, study how HPC applications and large language models can benefit from the CXL memory, and study the interplay between memory tiering and page interleaving. We also propose a novel data object-level interleaving policy to match the interleaving policy with memory access patterns. We reveal the challenges and opportunities of using the CXL memory.
- [7] arXiv:2412.05270 (replaced) [pdf, other]
-
Title: APOLLO: SGD-like Memory, AdamW-level PerformanceHanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon LeeComments: Preprint; update code link and visualizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance.
In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs.
Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.