Pro LLM Workspace

Overview

A professional LLM development workspace is a structured environment for building, testing, evaluating, and deploying large language model applications. It encompasses local model serving, prompt engineering workflows, evaluation pipelines, experiment tracking, and integration with both local and cloud-hosted models. This template provides a comprehensive guide to setting up a workspace that supports the full LLM development lifecycle, from rapid prototyping with local models to production deployment with cloud APIs. The workspace integrates tools like Ollama and LM Studio for local inference, LangChain and LlamaIndex for application frameworks, and evaluation tooling for systematic quality assessment. Whether you are building RAG systems, fine-tuning models, or developing agent workflows, this workspace configuration ensures reproducibility, fast iteration, and smooth transitions from development to production.

When to Use

Rapid LLM prototyping: Quickly test ideas with local models before committing to cloud API costs.
Prompt engineering workflow: Systematically develop, version, and evaluate prompts across multiple models.
RAG application development: Build and iterate on retrieval-augmented generation systems with local vector stores.
Model comparison: Evaluate multiple models (local and cloud) on the same tasks to select the best fit.
Fine-tuning workflows: Prepare datasets, run fine-tuning jobs, and evaluate fine-tuned models.
Team LLM development: Establish consistent development practices across a team building LLM applications.

Quick Start

Core Tool Installation


# Local model serving
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Python environment
python -m venv llm-workspace && source llm-workspace/bin/activate
pip install langchain langchain-openai langchain-community
pip install chromadb sentence-transformers
pip install jupyter ipykernel

# Evaluation and experiment tracking
pip install ragas mlflow promptfoo

Workspace Directory Structure

llm-workspace/
├── prompts/              # Versioned prompt templates
│   ├── v1/
│   └── v2/
├── data/                 # Training and evaluation data
│   ├── eval-sets/
│   └── fine-tune/
├── notebooks/            # Exploration and prototyping
├── src/                  # Application source code
│   ├── chains/
│   ├── agents/
│   └── tools/
├── evals/                # Evaluation scripts and results
├── configs/              # Model and environment configs
│   ├── models.yaml
│   └── .env
├── scripts/              # Utility scripts
└── mlruns/               # MLflow experiment tracking

Verify Setup


# Test local model
import ollama
response = ollama.chat(model="llama3.1:8b", messages=[
    {"role": "user", "content": "Hello, confirm you are running locally."}
])
print(response["message"]["content"])

# Test cloud model
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
print(llm.invoke("Confirm cloud connectivity.").content)

Core Concepts

Model Configuration Layer


# configs/models.yaml
models:
  local:
    fast:
      provider: ollama
      model: llama3.1:8b
      base_url: http://localhost:11434
      use_for: [prototyping, testing, simple-tasks]
    embed:
      provider: ollama
      model: nomic-embed-text
      base_url: http://localhost:11434

  cloud:
    primary:
      provider: openai
      model: gpt-4o
      use_for: [complex-reasoning, production]
    fast:
      provider: openai
      model: gpt-4o-mini
      use_for: [classification, simple-generation]
    embed:
      provider: openai
      model: text-embedding-3-small

routing:
  development: local.fast
  evaluation: cloud.primary
  production: cloud.primary
  embedding: local.embed

Model Router Implementation


import yaml
from langchain_openai import ChatOpenAI
from langchain_community.llms import Ollama

class ModelRouter:
    def __init__(self, config_path="configs/models.yaml"):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)

    def get_model(self, purpose: str = "development"):
        route = self.config["routing"].get(purpose, "development")
        category, name = route.split(".")
        model_config = self.config["models"][category][name]

        if model_config["provider"] == "ollama":
            return Ollama(
                model=model_config["model"],
                base_url=model_config["base_url"]
            )
        elif model_config["provider"] == "openai":
            return ChatOpenAI(model=model_config["model"])

router = ModelRouter()
dev_model = router.get_model("development")    # Local Ollama
prod_model = router.get_model("production")    # Cloud GPT-4o

Prompt Version Management


import json
from pathlib import Path
from datetime import datetime

class PromptManager:
    def __init__(self, prompts_dir="prompts"):
        self.dir = Path(prompts_dir)
        self.dir.mkdir(exist_ok=True)

    def save(self, name: str, template: str, metadata: dict = None):
        version = datetime.now().strftime("%Y%m%d_%H%M%S")
        prompt_dir = self.dir / name / version
        prompt_dir.mkdir(parents=True, exist_ok=True)

        (prompt_dir / "template.txt").write_text(template)
        (prompt_dir / "metadata.json").write_text(
            json.dumps({
                "version": version,
                "created": datetime.now().isoformat(),
                **(metadata or {})
            }, indent=2)
        )
        return version

    def load(self, name: str, version: str = "latest") -> str:
        prompt_dir = self.dir / name
        if version == "latest":
            versions = sorted(prompt_dir.iterdir())
            prompt_dir = versions[-1]
        else:
            prompt_dir = prompt_dir / version
        return (prompt_dir / "template.txt").read_text()

pm = PromptManager()
pm.save("summarizer", "Summarize the following:\n{text}\n\nKey points:")
template = pm.load("summarizer", "latest")

Evaluation Pipeline


import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    input: str
    expected: str
    tags: list[str] = None

def run_evaluation(model, eval_set: list[EvalCase], judge_model=None):
    results = []
    for case in eval_set:
        output = model.invoke(case.input)
        score = judge_model.invoke(
            f"Rate 1-5 how well this output matches the expected.\n"
            f"Output: {output}\nExpected: {case.expected}\nScore:"
        ) if judge_model else None
        results.append({
            "input": case.input,
            "output": str(output),
            "expected": case.expected,
            "score": score,
            "tags": case.tags
        })
    return results

# Usage
eval_set = [
    EvalCase("Summarize: AI is transforming...", "AI transforms industries..."),
    EvalCase("Summarize: Climate change...", "Climate change impacts..."),
]
results = run_evaluation(
    model=router.get_model("development"),
    eval_set=eval_set,
    judge_model=router.get_model("evaluation")
)

Configuration Reference

Tool	Purpose	Configuration
Ollama	Local model serving	`OLLAMA_HOST=0.0.0.0:11434`
LM Studio	GUI model management	Default port `1234`
MLflow	Experiment tracking	`MLFLOW_TRACKING_URI=./mlruns`
ChromaDB	Local vector store	`CHROMA_PERSIST_DIR=./chroma_db`
Promptfoo	Prompt evaluation	`promptfoo.yaml` config file
LangSmith	Cloud tracing	`LANGCHAIN_TRACING_V2=true`

Environment Variables

Variable	Description	Example
`OPENAI_API_KEY`	OpenAI API key	`sk-...`
`ANTHROPIC_API_KEY`	Anthropic API key	`sk-ant-...`
`OLLAMA_HOST`	Ollama server address	`http://localhost:11434`
`LANGCHAIN_TRACING_V2`	Enable LangSmith tracing	`true`
`MLFLOW_TRACKING_URI`	MLflow tracking server	`./mlruns`
`HF_TOKEN`	Hugging Face access token	`hf_...`

Best Practices

Use local models for development, cloud for evaluation: Prototype rapidly with Ollama-served models to avoid API costs, then validate with cloud models before production deployment.
Version everything: Track prompts, evaluation datasets, model configurations, and results with explicit versioning. Use Git for code and prompts, DVC for large datasets.
Build evaluation sets early: Create evaluation datasets from day one. Even 20-30 curated examples provide valuable signal for detecting regressions as you iterate on prompts and models.
Implement a model router: Abstract model selection behind a router that switches between local and cloud models based on the development stage. This makes it trivial to upgrade models or switch providers.
Use structured output schemas: Define Pydantic models for LLM outputs from the start. This catches format errors early and makes downstream processing reliable.
Track experiments with MLflow: Log every significant prompt change, model swap, or parameter tweak as an MLflow experiment. This creates an auditable history of what worked and what did not.
Separate concerns in your codebase: Keep prompts, chains, tools, and evaluation logic in separate directories. This makes it easy to test and swap components independently.
Test with multiple models: Regularly run your evaluation suite against different models to avoid overfitting your prompts to a single model's quirks.
Automate quality gates: Set up CI scripts that run evaluation suites and block merges when quality metrics drop below thresholds.
Document model-specific behaviors: Keep notes on model-specific prompt patterns, token limits, and known failure modes. This institutional knowledge accelerates onboarding and debugging.

Troubleshooting

Ollama model download fails or hangs Check available disk space (models range from 4GB to 40GB+). Verify network connectivity with curl http://localhost:11434/api/tags. Restart the Ollama service with ollama serve and retry the pull.

Local model responses are slow Ensure you are using a quantized model appropriate for your hardware. On CPU-only machines, use Q4_K_M quantizations. Check that no other processes are consuming GPU memory with nvidia-smi.

LangChain version conflicts Pin your LangChain versions explicitly: langchain==0.3.x, langchain-openai==0.2.x. Use a virtual environment per project and avoid mixing LangChain v0.2 and v0.3 dependencies.

Evaluation results inconsistent across runs Set temperature to 0 for evaluation runs. Use fixed random seeds where available. For LLM-as-a-judge evaluations, run multiple judge passes and average the scores.

Memory issues with large models locally Monitor RAM with htop. For models larger than available RAM, use smaller quantizations or switch to API-based models. Close other memory-intensive applications during local inference.

⚠️ Loading Issue

Pro Llm Workspace