Pro Llm Workspace
Enterprise-grade skill for production, ready, patterns, building. Includes structured workflows, validation checks, and reusable patterns for ai research.
Pro LLM Workspace
Overview
A professional LLM development workspace is a structured environment for building, testing, evaluating, and deploying large language model applications. It encompasses local model serving, prompt engineering workflows, evaluation pipelines, experiment tracking, and integration with both local and cloud-hosted models. This template provides a comprehensive guide to setting up a workspace that supports the full LLM development lifecycle, from rapid prototyping with local models to production deployment with cloud APIs. The workspace integrates tools like Ollama and LM Studio for local inference, LangChain and LlamaIndex for application frameworks, and evaluation tooling for systematic quality assessment. Whether you are building RAG systems, fine-tuning models, or developing agent workflows, this workspace configuration ensures reproducibility, fast iteration, and smooth transitions from development to production.
When to Use
- Rapid LLM prototyping: Quickly test ideas with local models before committing to cloud API costs.
- Prompt engineering workflow: Systematically develop, version, and evaluate prompts across multiple models.
- RAG application development: Build and iterate on retrieval-augmented generation systems with local vector stores.
- Model comparison: Evaluate multiple models (local and cloud) on the same tasks to select the best fit.
- Fine-tuning workflows: Prepare datasets, run fine-tuning jobs, and evaluate fine-tuned models.
- Team LLM development: Establish consistent development practices across a team building LLM applications.
Quick Start
Core Tool Installation
# Local model serving curl -fsSL https://ollama.com/install.sh | sh ollama pull llama3.1:8b ollama pull nomic-embed-text # Python environment python -m venv llm-workspace && source llm-workspace/bin/activate pip install langchain langchain-openai langchain-community pip install chromadb sentence-transformers pip install jupyter ipykernel # Evaluation and experiment tracking pip install ragas mlflow promptfoo
Workspace Directory Structure
llm-workspace/
āāā prompts/ # Versioned prompt templates
ā āāā v1/
ā āāā v2/
āāā data/ # Training and evaluation data
ā āāā eval-sets/
ā āāā fine-tune/
āāā notebooks/ # Exploration and prototyping
āāā src/ # Application source code
ā āāā chains/
ā āāā agents/
ā āāā tools/
āāā evals/ # Evaluation scripts and results
āāā configs/ # Model and environment configs
ā āāā models.yaml
ā āāā .env
āāā scripts/ # Utility scripts
āāā mlruns/ # MLflow experiment tracking
Verify Setup
# Test local model import ollama response = ollama.chat(model="llama3.1:8b", messages=[ {"role": "user", "content": "Hello, confirm you are running locally."} ]) print(response["message"]["content"]) # Test cloud model from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini") print(llm.invoke("Confirm cloud connectivity.").content)
Core Concepts
Model Configuration Layer
# configs/models.yaml models: local: fast: provider: ollama model: llama3.1:8b base_url: http://localhost:11434 use_for: [prototyping, testing, simple-tasks] embed: provider: ollama model: nomic-embed-text base_url: http://localhost:11434 cloud: primary: provider: openai model: gpt-4o use_for: [complex-reasoning, production] fast: provider: openai model: gpt-4o-mini use_for: [classification, simple-generation] embed: provider: openai model: text-embedding-3-small routing: development: local.fast evaluation: cloud.primary production: cloud.primary embedding: local.embed
Model Router Implementation
import yaml from langchain_openai import ChatOpenAI from langchain_community.llms import Ollama class ModelRouter: def __init__(self, config_path="configs/models.yaml"): with open(config_path) as f: self.config = yaml.safe_load(f) def get_model(self, purpose: str = "development"): route = self.config["routing"].get(purpose, "development") category, name = route.split(".") model_config = self.config["models"][category][name] if model_config["provider"] == "ollama": return Ollama( model=model_config["model"], base_url=model_config["base_url"] ) elif model_config["provider"] == "openai": return ChatOpenAI(model=model_config["model"]) router = ModelRouter() dev_model = router.get_model("development") # Local Ollama prod_model = router.get_model("production") # Cloud GPT-4o
Prompt Version Management
import json from pathlib import Path from datetime import datetime class PromptManager: def __init__(self, prompts_dir="prompts"): self.dir = Path(prompts_dir) self.dir.mkdir(exist_ok=True) def save(self, name: str, template: str, metadata: dict = None): version = datetime.now().strftime("%Y%m%d_%H%M%S") prompt_dir = self.dir / name / version prompt_dir.mkdir(parents=True, exist_ok=True) (prompt_dir / "template.txt").write_text(template) (prompt_dir / "metadata.json").write_text( json.dumps({ "version": version, "created": datetime.now().isoformat(), **(metadata or {}) }, indent=2) ) return version def load(self, name: str, version: str = "latest") -> str: prompt_dir = self.dir / name if version == "latest": versions = sorted(prompt_dir.iterdir()) prompt_dir = versions[-1] else: prompt_dir = prompt_dir / version return (prompt_dir / "template.txt").read_text() pm = PromptManager() pm.save("summarizer", "Summarize the following:\n{text}\n\nKey points:") template = pm.load("summarizer", "latest")
Evaluation Pipeline
import json from dataclasses import dataclass @dataclass class EvalCase: input: str expected: str tags: list[str] = None def run_evaluation(model, eval_set: list[EvalCase], judge_model=None): results = [] for case in eval_set: output = model.invoke(case.input) score = judge_model.invoke( f"Rate 1-5 how well this output matches the expected.\n" f"Output: {output}\nExpected: {case.expected}\nScore:" ) if judge_model else None results.append({ "input": case.input, "output": str(output), "expected": case.expected, "score": score, "tags": case.tags }) return results # Usage eval_set = [ EvalCase("Summarize: AI is transforming...", "AI transforms industries..."), EvalCase("Summarize: Climate change...", "Climate change impacts..."), ] results = run_evaluation( model=router.get_model("development"), eval_set=eval_set, judge_model=router.get_model("evaluation") )
Configuration Reference
| Tool | Purpose | Configuration |
|---|---|---|
| Ollama | Local model serving | OLLAMA_HOST=0.0.0.0:11434 |
| LM Studio | GUI model management | Default port 1234 |
| MLflow | Experiment tracking | MLFLOW_TRACKING_URI=./mlruns |
| ChromaDB | Local vector store | CHROMA_PERSIST_DIR=./chroma_db |
| Promptfoo | Prompt evaluation | promptfoo.yaml config file |
| LangSmith | Cloud tracing | LANGCHAIN_TRACING_V2=true |
Environment Variables
| Variable | Description | Example |
|---|---|---|
OPENAI_API_KEY | OpenAI API key | sk-... |
ANTHROPIC_API_KEY | Anthropic API key | sk-ant-... |
OLLAMA_HOST | Ollama server address | http://localhost:11434 |
LANGCHAIN_TRACING_V2 | Enable LangSmith tracing | true |
MLFLOW_TRACKING_URI | MLflow tracking server | ./mlruns |
HF_TOKEN | Hugging Face access token | hf_... |
Best Practices
-
Use local models for development, cloud for evaluation: Prototype rapidly with Ollama-served models to avoid API costs, then validate with cloud models before production deployment.
-
Version everything: Track prompts, evaluation datasets, model configurations, and results with explicit versioning. Use Git for code and prompts, DVC for large datasets.
-
Build evaluation sets early: Create evaluation datasets from day one. Even 20-30 curated examples provide valuable signal for detecting regressions as you iterate on prompts and models.
-
Implement a model router: Abstract model selection behind a router that switches between local and cloud models based on the development stage. This makes it trivial to upgrade models or switch providers.
-
Use structured output schemas: Define Pydantic models for LLM outputs from the start. This catches format errors early and makes downstream processing reliable.
-
Track experiments with MLflow: Log every significant prompt change, model swap, or parameter tweak as an MLflow experiment. This creates an auditable history of what worked and what did not.
-
Separate concerns in your codebase: Keep prompts, chains, tools, and evaluation logic in separate directories. This makes it easy to test and swap components independently.
-
Test with multiple models: Regularly run your evaluation suite against different models to avoid overfitting your prompts to a single model's quirks.
-
Automate quality gates: Set up CI scripts that run evaluation suites and block merges when quality metrics drop below thresholds.
-
Document model-specific behaviors: Keep notes on model-specific prompt patterns, token limits, and known failure modes. This institutional knowledge accelerates onboarding and debugging.
Troubleshooting
Ollama model download fails or hangs
Check available disk space (models range from 4GB to 40GB+). Verify network connectivity with curl http://localhost:11434/api/tags. Restart the Ollama service with ollama serve and retry the pull.
Local model responses are slow
Ensure you are using a quantized model appropriate for your hardware. On CPU-only machines, use Q4_K_M quantizations. Check that no other processes are consuming GPU memory with nvidia-smi.
LangChain version conflicts
Pin your LangChain versions explicitly: langchain==0.3.x, langchain-openai==0.2.x. Use a virtual environment per project and avoid mixing LangChain v0.2 and v0.3 dependencies.
Evaluation results inconsistent across runs Set temperature to 0 for evaluation runs. Use fixed random seeds where available. For LLM-as-a-judge evaluations, run multiple judge passes and average the scores.
Memory issues with large models locally
Monitor RAM with htop. For models larger than available RAM, use smaller quantizations or switch to API-based models. Close other memory-intensive applications during local inference.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.