What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Ollama

Ollama is an open-source runtime that enables developers and researchers to download, run, and manage large language models locally on consumer hardware without cloud API dependencies.

6 min readLast updated June 2026Infrastructure

Ollama is an open-source tool that allows users to run large language models locally on personal computers and servers. Often described as "Docker for AI models", Ollama abstracts the complexity of model management, hardware acceleration, and memory optimisation behind a simple command-line interface and HTTP API. Since its release in 2023, it has become one of the most widely adopted local inference runtimes in the developer community, accumulating hundreds of millions of model downloads.

Background and Motivation

The rise of open-weight models such as Meta's Llama series, Mistral, and Qwen created demand for tooling that could make these models accessible without requiring cloud API subscriptions or deep knowledge of GPU programming. Prior to Ollama, running a local model typically required manually installing llama.cpp or PyTorch, configuring CUDA drivers, and handling model quantisation. Ollama packaged these steps into a single binary that detects available hardware and selects appropriate inference parameters automatically.

The project is maintained by Ollama, Inc. and distributed under the MIT licence, with the source code hosted on GitHub. By mid-2026, the Ollama model library contained over 200 model families including Llama 3.3, DeepSeek, Gemma, Phi, Qwen, and Mistral, each available in multiple quantisation levels.

Architecture

Ollama operates as a daemon process that listens on localhost port 11434. When a user pulls a model, Ollama downloads the model weights in GGUF format (a binary format designed for efficient CPU and GPU inference), stores them in a local registry, and loads them into memory on demand. The inference backend is llama.cpp, a highly optimised C++ library that supports CPU inference on x86 and ARM architectures, as well as GPU acceleration via CUDA on NVIDIA hardware and Metal on Apple Silicon.

The tool exposes an OpenAI-compatible REST API, meaning that applications built against the OpenAI SDK can typically switch to a locally-running Ollama instance by changing a single endpoint URL. This compatibility has been a significant factor in adoption, as it lowers the cost of integrating local models into existing workflows.

Key Components

Ollama consists of several layers working together. The model registry handles download, versioning, and deduplication of model layers, analogous to a container image registry. The runtime engine manages context windows, batching, and memory allocation, including optional GPU offloading of individual model layers when VRAM is limited. The REST API layer translates HTTP requests into inference calls and streams token output using server-sent events. An optional CLI wraps common operations such as model listing, removal, and interactive chat sessions.

Quantisation and Performance

Ollama primarily serves GGUF-format models, which apply quantisation techniques to reduce the precision of model weights from 32-bit or 16-bit floating point to lower bit representations such as 4-bit or 8-bit integers. Quantisation reduces memory requirements significantly: a 7-billion-parameter model at 4-bit precision requires roughly 4 GB of RAM, making it runnable on laptops with 8 GB of total system memory.

On Apple Silicon hardware with unified memory, Ollama delivers particularly strong performance because the GPU can access system RAM directly without separate VRAM. A MacBook Pro with an M3 chip can generate tokens at 60-100 tokens per second for an 8B-parameter model. On NVIDIA consumer GPUs, performance scales with VRAM capacity; an RTX 4090 with 24 GB VRAM can run 70B parameter models at 4-bit quantisation at approximately 30-50 tokens per second.

Comparison with Alternatives

| Tool | Primary Use Case | Backend | API Compatibility | |---|---|---|---| | Ollama | Local developer inference | llama.cpp | OpenAI-compatible | | vLLM | Production serving, high throughput | PyTorch | OpenAI-compatible | | LM Studio | GUI-based local inference | llama.cpp | OpenAI-compatible | | llama.cpp | Bare-metal CLI inference | C++ native | Custom | | Text Generation Inference (TGI) | Production Hugging Face serving | PyTorch | Custom |

Ollama is generally recommended for individual developers and small teams requiring local inference with minimal setup. For production deployments requiring high concurrency and throughput, vLLM is typically preferred.

Ecosystem Integration

Ollama integrates with major AI application frameworks including LangChain, LlamaIndex, and LangGraph, which provide direct client connectors for the Ollama API. Many popular open-source tools such as OpenWebUI, AnythingLLM, and Continue (a VS Code extension) use Ollama as a backend. The Modelfile format allows users to customise system prompts, temperature settings, and context lengths for specific model configurations, enabling repeatable deployment of customised model personalities.

Privacy and Compliance Implications

A central use case for Ollama is processing sensitive or proprietary data that organisations cannot send to external cloud APIs. Lawyers, healthcare professionals, and enterprise software developers use local inference to analyse documents without data leaving their networks. This is particularly relevant in jurisdictions with strict data localisation requirements.

Malaysian Context — Local AI for Data Sovereignty

Malaysia's Personal Data Protection Act (PDPA) 2010, and its amendments under the PDPA Amendment Act 2023, impose obligations on organisations handling personal data to ensure it is not transferred outside Malaysia without adequate protection. This regulatory context has increased interest in on-premises AI inference tools such as Ollama among Malaysian enterprises, particularly in sectors such as legal services, healthcare, and financial services that routinely process sensitive personal information.

Malaysian banks regulated by Bank Negara Malaysia (BNM) and securities firms under the Securities Commission Malaysia are subject to additional IT risk management guidelines that require data residency controls. For these institutions, local model inference using Ollama provides a compliant path to integrating AI assistance into internal workflows without routing customer data through overseas cloud providers.

The Malaysian government's MyDigital Blueprint and the Malaysia Digital Economy Corporation (MDEC) have both emphasised building local AI capability. Ollama and similar tools are used in developer bootcamps and hackathons organised by MDEC and HRD Corp-accredited training providers as an accessible entry point for Malaysian developers learning to work with open-weight models.

Several Malaysian technology companies, including those in the MSC Malaysia technology park ecosystem in Cyberjaya, have adopted Ollama for prototyping AI-assisted document processing and customer service automation. The tool's zero-cost licensing makes it attractive for Malaysian startups with limited budgets for AI experimentation.

Regional cloud providers operating in Malaysia, such as AWS and Microsoft Azure, offer GPU-optimised virtual machines that can run Ollama for teams that want local-style control in a cloud environment with Malaysian data residency guarantees, combining the privacy benefits of local inference with the scalability of cloud infrastructure.

References

Ollama, Inc. (2024). Ollama Documentation. https://ollama.com/
Gerganov, G. (2023). llama.cpp: Port of Facebook's LLaMA model in C/C++. GitHub. https://github.com/ggerganov/llama.cpp
Personal Data Protection Department Malaysia. (2023). Personal Data Protection Act 2010 (Amendment 2023). Ministry of Digital, Malaysia.
MDEC. (2024). Malaysia Digital Economy Blueprint. Malaysia Digital Economy Corporation.
Aijtkumar. (2025). The Complete Guide to Ollama: Run Large Language Models Locally. DEV Community.

Tags:local-llm inference open-source developer-tools

Type	Local LLM runtime
Developer	Ollama, Inc.
Initial release	2023
License	MIT
API compatibility	OpenAI-compatible REST API
Related	vLLM, LM Studio, llama.cpp

Background and Motivation

Architecture

Key Components

Quantisation and Performance

Comparison with Alternatives

Ecosystem Integration

Privacy and Compliance Implications

See Also

References