Abstract:
Currently, LLMs have become mature enough and are demonstrating application potential in various scenarios. With the development of the AI agent ecosystem, the inference workload of LLM has surpassed the training workload, becoming the core engine driving the growth of computing power demand. However, this shift in focus from model "training" to "inference" has placed significant pressure on inference infrastructure. This article analyzes the evolutionary trends of LLM inference infrastructure, summarizes the technical challenges it faces across the four dimensions of "computing, transmission, storage, and scheduling, " and, in combination with Lenovo's innovative practices, proposes targeted hardware-software collaborative solutions.