Comprehensive Research Engineer

Senior research engineering methodology for bridging theoretical computer science and high-performance implementation — emphasizing scientific rigor, reproducible experiments, and production-quality research code.

When to Use

Apply this methodology when:

Implementing ML/AI research papers from scratch
Running reproducible experiments with statistical rigor
Building research prototypes that need to scale to production
Reviewing and critiquing research claims with empirical evidence

Use standard engineering practices when:

Building standard application features
Tasks without research or experimental components
Well-established patterns with known solutions

Quick Start

Experiment Template


import json
import hashlib
from dataclasses import dataclass, asdict
from datetime import datetime
from pathlib import Path

@dataclass
class ExperimentConfig:
    name: str
    model: str
    dataset: str
    learning_rate: float
    batch_size: int
    epochs: int
    seed: int = 42

    @property
    def experiment_id(self):
        config_str = json.dumps(asdict(self), sort_keys=True)
        return hashlib.sha256(config_str.encode()).hexdigest()[:12]

class Experiment:
    def __init__(self, config: ExperimentConfig):
        self.config = config
        self.output_dir = Path(f"experiments/{config.experiment_id}")
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self._save_config()

    def _save_config(self):
        with open(self.output_dir / "config.json", "w") as f:
            json.dump(asdict(self.config), f, indent=2)

    def run(self):
        self._set_seeds(self.config.seed)
        metrics = self._train()
        self._save_results(metrics)
        return metrics

    def _set_seeds(self, seed):
        import random, numpy as np, torch
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)

    def _save_results(self, metrics):
        results = {
            "config": asdict(self.config),
            "metrics": metrics,
            "timestamp": datetime.now().isoformat(),
        }
        with open(self.output_dir / "results.json", "w") as f:
            json.dump(results, f, indent=2)

Paper Implementation Checklist


## Paper: {paper_title}
- [ ] Read paper 3 times (overview, details, critique)
- [ ] Identify key claims and expected results
- [ ] List all hyperparameters mentioned
- [ ] Find reference implementation (if exists)
- [ ] Implement core algorithm
- [ ] Reproduce Table 1 / Figure 1 results
- [ ] Run ablation studies
- [ ] Document deviations from paper
- [ ] Statistical significance testing (3+ seeds)

Core Concepts

Scientific Rigor Principles

Principle	Implementation	Why It Matters
Reproducibility	Fixed seeds, versioned data, logged configs	Others must replicate your results
Statistical validity	Multiple runs, confidence intervals	Single runs are noise
Fair comparison	Same compute budget, same data splits	Apples-to-apples only
Ablation	Change one variable at a time	Isolate what actually helps
Documentation	Log everything, explain decisions	Future you will forget

Experiment Management

experiments/
  ├── {experiment_id}/
  │   ├── config.json          # Full configuration
  │   ├── results.json         # Metrics and outputs
  │   ├── logs/                # Training logs
  │   ├── checkpoints/         # Model checkpoints
  │   └── analysis/            # Post-hoc analysis
  ├── comparisons/
  │   └── {baseline_vs_method}.json
  └── paper_results/
      └── {table_or_figure}.json

Statistical Testing


import numpy as np
from scipy import stats

def compare_methods(baseline_scores, method_scores, alpha=0.05):
    """Statistical comparison of two methods."""
    t_stat, p_value = stats.ttest_ind(baseline_scores, method_scores)

    return {
        "baseline_mean": np.mean(baseline_scores),
        "baseline_std": np.std(baseline_scores),
        "method_mean": np.mean(method_scores),
        "method_std": np.std(method_scores),
        "t_statistic": t_stat,
        "p_value": p_value,
        "significant": p_value < alpha,
        "effect_size": (np.mean(method_scores) - np.mean(baseline_scores)) / np.std(baseline_scores)
    }

Configuration

Parameter	Description
`seed`	Random seed for reproducibility
`num_runs`	Runs per configuration (minimum 3)
`confidence_level`	Statistical significance threshold (0.05)
`checkpoint_interval`	Steps between model saves
`log_interval`	Steps between metric logging
`wandb_project`	Experiment tracking project

Best Practices

Run every experiment at least 3 times with different seeds — report mean and standard deviation
Log everything — configs, git commit, library versions, hardware specs
Compare fairly — same compute budget, data splits, and preprocessing
Ablate systematically — change one thing at a time to understand contributions
Read the paper three times before implementing — overview, then details, then critique
Version your datasets — model results are meaningless without knowing the exact data

Common Issues

Cannot reproduce paper results: Check for undocumented hyperparameters (warmup, gradient clipping, weight decay). Try the reference implementation if available. Contact authors — they may have errata or unreported details.

High variance across runs: Increase number of runs. Check for non-deterministic operations (dropout, data shuffling). Use larger evaluation sets.

Experiment tracking chaos: Use a structured directory layout. Never overwrite results — create new experiment IDs. Use tools like Weights & Biases or MLflow for tracking.

⚠️ Loading Issue

Comprehensive Research Engineer

Comprehensive Research Engineer

When to Use

Quick Start

Experiment Template

Paper Implementation Checklist

Core Concepts

Scientific Rigor Principles

Experiment Management

Statistical Testing

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace