lm.c - Lightweight CPU Inference Engine for LLMs

Lightweight CPU Inference Engine for Large Language Models

Run powerful LLMs on any CPU with zero dependencies. A single-file C99 implementation that brings AI capabilities to standard hardware.

LM.C is a research project by NileAGI

View on GitHub

Why lm.c?

Built for accessibility, efficiency, and maximum portability

Zero Dependencies

Single-file C99 implementation runs anywhere without external libraries

30+ Quantization Formats

Supports all GGML formats from F32 to IQ1_M for maximum efficiency

CPU Optimized

Designed specifically for CPU inference with minimal memory footprint

Portable & Lightweight

Works on any system with a C compiler - no GPU required

System Architecture

A streamlined pipeline from model loading to text generation

GGUF File Loading

↓

Header & Metadata Parsing

↓

Tensor Info Loading

↓

Quantization Handling

↓

Transformer Execution

↓

Token Generation

↓

Text Output

Core Components

Robust, optimized components working together seamlessly

GGUF Parser

Handles all GGUF metadata types and quantization formats with zero dependencies

Quantization Engine

Supports 30+ GGML quantization formats from F32 to IQ1_M

CPU Inference

Optimized transformer execution with minimal memory footprint

Portable Runtime

Single-file C99 implementation runs anywhere

How It Works

From input text to generated output - a streamlined inference workflow

Input Text

→

Tokenization

→

Embedding Lookup

→

Transformer Layers

Layer Norm

→

Attention

→

FFN

→

Residual Add

Final Norm

→

Output Projection

→

Sampling

→

Generated Text

GGUF File Structure

Efficient storage and loading format for large language models

struct gguf_header_t {
    uint32_t magic;          // "GGUF"
    uint32_t version;         // Format version
    uint64_t tensor_count;    // Number of tensors
    uint64_t metadata_kv_count;
    gguf_metadata_kv_t metadata_kv[];
};

Memory Efficient Design

Optimized techniques for minimal memory footprint

GGUF Parser

Quantization

Tensor Mapping

Activation Buffers

KV Cache

Token Buffers

SIMD Registers

Thread Pools

Development Roadmap

Ongoing development and planned features

✓

GGUF File Loader: Complete with metadata extraction

✓

Tensor Data Mapping: Memory-mapped tensor access

✓

Quantization Kernels: All 30+ GGML formats

○

Transformer Layers: CPU-optimized implementation

○

Tokenization: Byte-pair encoding support

○

Sampling: Temperature-based token selection

○

SIMD Optimization: AVX2/NEON acceleration

○

Thread Parallelism: Multi-core support

○

Interactive Mode: Chat interface

Performance Optimizations

CPU-specific enhancements for maximum efficiency

Quantization Aware Ops

Process quantized weights directly without full dequantization

Block Processing

Optimized cache utilization for better memory access patterns

Memory Mapping

Zero-copy weight access for reduced memory overhead

Thread Parallelism

Layer-wise execution across multiple CPU cores

Ready to explore lm.c?

Dive into the code, contribute to the project, or learn more about how lm.c is pushing the boundaries of accessible AI.

A research project by NileAGI

Visit GitHub Repository