Comprehensive Research Engineer
Comprehensive skill designed for uncompromising, academic, research, engineer. Includes structured workflows, validation checks, and reusable patterns for ai research.
Comprehensive Research Engineer
Senior research engineering methodology for bridging theoretical computer science and high-performance implementation — emphasizing scientific rigor, reproducible experiments, and production-quality research code.
When to Use
Apply this methodology when:
- Implementing ML/AI research papers from scratch
- Running reproducible experiments with statistical rigor
- Building research prototypes that need to scale to production
- Reviewing and critiquing research claims with empirical evidence
Use standard engineering practices when:
- Building standard application features
- Tasks without research or experimental components
- Well-established patterns with known solutions
Quick Start
Experiment Template
import json import hashlib from dataclasses import dataclass, asdict from datetime import datetime from pathlib import Path @dataclass class ExperimentConfig: name: str model: str dataset: str learning_rate: float batch_size: int epochs: int seed: int = 42 @property def experiment_id(self): config_str = json.dumps(asdict(self), sort_keys=True) return hashlib.sha256(config_str.encode()).hexdigest()[:12] class Experiment: def __init__(self, config: ExperimentConfig): self.config = config self.output_dir = Path(f"experiments/{config.experiment_id}") self.output_dir.mkdir(parents=True, exist_ok=True) self._save_config() def _save_config(self): with open(self.output_dir / "config.json", "w") as f: json.dump(asdict(self.config), f, indent=2) def run(self): self._set_seeds(self.config.seed) metrics = self._train() self._save_results(metrics) return metrics def _set_seeds(self, seed): import random, numpy as np, torch random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) def _save_results(self, metrics): results = { "config": asdict(self.config), "metrics": metrics, "timestamp": datetime.now().isoformat(), } with open(self.output_dir / "results.json", "w") as f: json.dump(results, f, indent=2)
Paper Implementation Checklist
## Paper: {paper_title} - [ ] Read paper 3 times (overview, details, critique) - [ ] Identify key claims and expected results - [ ] List all hyperparameters mentioned - [ ] Find reference implementation (if exists) - [ ] Implement core algorithm - [ ] Reproduce Table 1 / Figure 1 results - [ ] Run ablation studies - [ ] Document deviations from paper - [ ] Statistical significance testing (3+ seeds)
Core Concepts
Scientific Rigor Principles
| Principle | Implementation | Why It Matters |
|---|---|---|
| Reproducibility | Fixed seeds, versioned data, logged configs | Others must replicate your results |
| Statistical validity | Multiple runs, confidence intervals | Single runs are noise |
| Fair comparison | Same compute budget, same data splits | Apples-to-apples only |
| Ablation | Change one variable at a time | Isolate what actually helps |
| Documentation | Log everything, explain decisions | Future you will forget |
Experiment Management
experiments/
├── {experiment_id}/
│ ├── config.json # Full configuration
│ ├── results.json # Metrics and outputs
│ ├── logs/ # Training logs
│ ├── checkpoints/ # Model checkpoints
│ └── analysis/ # Post-hoc analysis
├── comparisons/
│ └── {baseline_vs_method}.json
└── paper_results/
└── {table_or_figure}.json
Statistical Testing
import numpy as np from scipy import stats def compare_methods(baseline_scores, method_scores, alpha=0.05): """Statistical comparison of two methods.""" t_stat, p_value = stats.ttest_ind(baseline_scores, method_scores) return { "baseline_mean": np.mean(baseline_scores), "baseline_std": np.std(baseline_scores), "method_mean": np.mean(method_scores), "method_std": np.std(method_scores), "t_statistic": t_stat, "p_value": p_value, "significant": p_value < alpha, "effect_size": (np.mean(method_scores) - np.mean(baseline_scores)) / np.std(baseline_scores) }
Configuration
| Parameter | Description |
|---|---|
seed | Random seed for reproducibility |
num_runs | Runs per configuration (minimum 3) |
confidence_level | Statistical significance threshold (0.05) |
checkpoint_interval | Steps between model saves |
log_interval | Steps between metric logging |
wandb_project | Experiment tracking project |
Best Practices
- Run every experiment at least 3 times with different seeds — report mean and standard deviation
- Log everything — configs, git commit, library versions, hardware specs
- Compare fairly — same compute budget, data splits, and preprocessing
- Ablate systematically — change one thing at a time to understand contributions
- Read the paper three times before implementing — overview, then details, then critique
- Version your datasets — model results are meaningless without knowing the exact data
Common Issues
Cannot reproduce paper results: Check for undocumented hyperparameters (warmup, gradient clipping, weight decay). Try the reference implementation if available. Contact authors — they may have errata or unreported details.
High variance across runs: Increase number of runs. Check for non-deterministic operations (dropout, data shuffling). Use larger evaluation sets.
Experiment tracking chaos: Use a structured directory layout. Never overwrite results — create new experiment IDs. Use tools like Weights & Biases or MLflow for tracking.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.