OptoLoop: An optogenetic tool to probe the functional role of genome organization

The genome folds inside the cell nucleus into hierarchical architectural features, such as chromatin loops and domains. If and how this genome organization influences the

Integrating Longitudinal Metabolite Profiles Improves Trait Prediction in Pigs in a Trait- and Timepoint-Dependent Manner

Background Accurate prediction of genetic merit is essential for accelerating genetic improvement in pigs, particularly for traits that are costly or difficult to measure directly.

A De Novo Algorithm for Allele Reconstruction from Oxford Nanopore Amplicon Reads, with Application to CYP2D6

The Oxford Nanopore Technologies’ sequencing platform offers a path towards bedside genomics, producing long reads that can completely cover a gene of interest, and thus

Efficacy of Minnelide in a Next-Generation Dual-Recombinase Regulated Genetically Engineered Mouse Model of CIC::DUX4 Sarcoma

CIC::DUX4 sarcoma (CDS) is a lethal cancer driven by a fusion between tumor suppressor Capicua (CIC) and pioneer transcription factor double homeobox 4 (DUX4). To

AI-assisted Image-Based Phenotyping Reveals Genetic Architecture of Pod Traits in Mungbean (Vigna radiata L.)

Mungbean (Vigna radiata (L.) R. Wilczek) is a vital source of digestible proteins and is well-suited for the plant-based protein industry. In this study, we

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

November 7, 2025

arXiv:2511.04002v1 Announce Type: cross
Abstract: Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844