Futures

Local ML and the Future of Inference, from (20231230.)

External link

Summary

In the prediction for 2024, the author foresees a significant rise in local machine learning (ML). This trend will be driven by the adoption of Apple Silicon and other innovative hardware, as well as the increased capabilities of raw CPU and mobile devices. Local inference will become a viable alternative to hosted inference, particularly for smaller models. However, for Giant Models (LLM), the limitations of Apple Silicon in terms of computational power and memory make it unlikely to match the performance of servers with multi GPGPU. Furthermore, the integration of IO system and IO data would present challenges. The author also emphasizes the importance of Hugging Face as a GitHub for models. Additionally, the author predicts the emergence of task-specialized and smaller LLMs at the edge, which will prove practical for numerous use cases, including personal local agents. The ability to benefit from larger LLMs in the cloud for continuous tuning and distillation is also highlighted. The author speculates on future possibilities such as OS-level model access and on-demand model routing. The development of alternative hardware or GPU solutions for LLM democratization is also mentioned, including the potential for CUDA alternatives from companies like AMD, Intel, or BigTech.

Keywords

Themes

Signals

Signal Change 10y horizon Driving force
Growing adoption of Apple Silicon and other innovative hardware From cloud-based inference to local machine learning Local inference becomes a viable alternative to hosted inference Adoption of Apple Silicon and other innovative hardware
Development of specialized and smaller LLMs for edge computing From large LLMs to task-specific LLMs at the edge Specialized LLMs become the norm for local and task-specific use cases Practicality and efficiency for a large number of use cases
Potential for OS to provide centralized local model access Centralized access to local models through the OS OS-level service for accessing local models Improving user experience and reducing model duplication on devices
Advancements in hardware capabilities from companies like AMD, Intel, and Qualcomm From current hardware capabilities to more powerful alternatives Continued advancements in hardware for local AI Competition among companies and demand for more powerful hardware
Exploration of different ways to serve local AI, such as Apple silicon, transformers.js, ONNX, and WASM models From traditional training pipelines to various approaches for running models locally Adoption of different technologies for local AI serving Experimentation and finding the most efficient approach for local AI serving

Closest