NVIDIA's primary statement is unusually explicit: yes, you can download and run NVIDIA Nemotron models from Hugging Face for free in production. The licensing architecture, the production family, and the on-prem fine-tuning workflow are the factual basis on which a municipality can credibly call its deployment sovereign.
The licensing architecture matters because municipalities are not buyers of opaque APIs. They are stewards of public records, criminal-justice data, and constituent identity. The NVIDIA Open Model License Agreement (June 2024) and the Nemotron Open Model License (December 2025) are permissive: commercial use, redistribution, modification, no attribution required on outputs. The Llama-Nemotron line inherits the Meta Llama Community License plus OMLA layered on NVIDIA modifications. The legal basis for sovereignty is in the licence, not in the marketing.
"
The sentence the city procurement officer needs to hear
A municipal customer can download the weights, store them on its own GPUs, fine-tune with its own data, redistribute derivatives within its own bureaucracy, and never make an external API call. That is the factual basis on which "sovereign" is more than slogan. It is also the basis on which the City Attorney signs.
The production familyscales from 2 GB to 8 H100s
Six models cover the workload spread we expect from city departments, SJSU, public safety, and Inception startups. The smallest fits on existing city hardware; the largest is rated by Artificial Analysis (April 2025) as the most intelligent open-source model in production.
Nemotron-Mini-4B-Instruct
~2 GB VRAM. Ideal for 311 chatbots and edge devices. The civic agent class.
Llama-3.1-Nemotron-Nano-8B
Single H100 fit. 128k context. The reasoning model for permit and code workloads.
Llama-3.3-Nemotron-Super-49B
Single H100 80GB fit. The reference for the civic-chatbot Guardrails pattern.
Llama-3.1-Nemotron-Ultra-253B
Eight H100 node. Rated April 2025 as the most intelligent open-source model.
Nemotron 3 Nano (30B-A3B MoE)
1M token context. Four times the throughput of Nemotron 2 Nano. Document-scale RAG.
NemoGuard 8B
Content Safety, Topic Control, Jailbreak Detect. The civic guardrails substrate.
The sovereign fine-tuning workflowend-to-end on NVIDIA primitives
NeMo Customizer is an API-first Kubernetes microservice. It supports LoRA, full SFT, DPO, GRPO, and Knowledge Distillation, and is deployable on customer A100 80GB or B200. NVIDIA explicitly markets the on-prem story for sensitive data by keeping everything on-premises. NeMo Curator is Apache 2.0, GPU-accelerated via Ray and RAPIDS, and ships 30+ heuristic filters, fastText classification, exact, fuzzy, and semantic deduplication, and PII redaction via Presidio integration. NeMo Retriever provides extraction NIMs, embedding NIMs, and reranking NIMs as the RAG plumbing.
Composed in the San José deployment, the workflow looks like this: city data lands in the curator, filtered and redacted in place, fed to the customizer, which produces a fine-tuned Nemotron variant on the city's own GPUs, retrieved at runtime through the retriever, guardrailed by NemoGuard, served to the citizen-facing interface. No external API call at any stage. The supply chain of the inference, end-to-end, is under municipal control.
→
The civic Guardrails pattern is the publishable artifact
The Llama-3.3-Nemotron-Super-49B + NemoGuard 8B + NeMo Retriever combination, configured for the 311 chatbot, the permit AI, and the code-enforcement workload, becomes the reference configuration for any other US city building the same thing. Asking NVIDIA to publish this configuration as an official NVIDIA AI Blueprint, replacing the current ABC-bot reference, is open question 6 in the brief. It is a clean ask with clean upside for both sides.