arXiv:2505.10887v3 Announce Type: replace
Abstract: This paper introduces textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $mathbf7.27%$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
Differential acceptance of a national digital health platform among community and frontline health workers in Cote d’Ivoire: a cross-sectional study
IntroductionMobile-based digital health solutions are critical technologies that play a significant role in improving the quality of healthcare services. Cote d’Ivoire is digitizing its community-based