Ultimate LaminDB Framework

Build production-grade biological data management pipelines using LaminDB, a data framework purpose-built for biology. This skill covers data ingestion, artifact tracking, lineage management, and integration with popular bioinformatics tools for managing datasets across experiments and analyses.

When to Use This Skill

Choose Ultimate LaminDB Framework when you need to:

Track biological data artifacts with full provenance and lineage
Manage AnnData, FASTA, BAM, or other bioinformatics file types with metadata
Build reproducible analysis pipelines with automatic data versioning
Query and retrieve datasets across multiple experiments using biological ontologies

Consider alternatives when:

You need a general-purpose data warehouse (use Snowflake or BigQuery)
You need only file storage without biological metadata (use S3 or GCS directly)
You need interactive data exploration without pipeline management (use CellxGene)

Quick Start


# Install LaminDB with biological extras
pip install lamindb[bionty]


import lamindb as ln
import bionty as bt

# Initialize a new LaminDB instance
ln.setup.init(storage="./my_research_data", schema="bionty")

# Register a dataset
import anndata as ad
adata = ad.read_h5ad("my_scrnaseq.h5ad")

# Create an artifact with biological metadata
artifact = ln.Artifact.from_anndata(
    adata,
    description="Single-cell RNA-seq of human PBMC",
    key="datasets/pbmc_10k.h5ad"
)

# Annotate with ontology terms
cell_types = bt.CellType.from_values(adata.obs["cell_type"].unique())
artifact.cell_types.set(cell_types)

artifact.save()
print(f"Saved artifact: {artifact.uid}")

Core Concepts

Data Model

Component	Purpose	Example
`Artifact`	Any data object (file, array, DataFrame)	H5AD file, CSV, FASTA
`Collection`	Grouped artifacts with shared context	All samples from one experiment
`Transform`	Code that creates or modifies artifacts	Jupyter notebook, Python script
`Run`	Single execution of a transform	Analysis run on 2024-01-15
`Feature`	Measured variable or annotation column	Gene name, cell type label
`ULabel`	Universal label for categorization	Tissue type, disease state

Lineage Tracking


import lamindb as ln

# Track a transform (analysis script)
ln.track("my_analysis_v2")

# Load input artifacts
raw = ln.Artifact.filter(description__contains="raw counts").one()
adata = raw.load()

# Perform analysis
import scanpy as sc
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.tl.umap(adata)

# Save output — lineage automatically links input → transform → output
processed = ln.Artifact.from_anndata(
    adata,
    description="Normalized and processed PBMC data"
)
processed.save()

# Query lineage
print(processed.run)        # Which run created this?
print(processed.run.input_artifacts.all())  # What went in?
print(processed.transform)  # What code was used?

Querying with Biological Ontologies


import lamindb as ln
import bionty as bt

# Find all artifacts annotated with specific cell types
t_cells = bt.CellType.filter(name__contains="T cell").all()
artifacts = ln.Artifact.filter(cell_types__in=t_cells).all()

# Search by tissue
brain = bt.Tissue.filter(name="brain").one()
brain_data = ln.Artifact.filter(tissues=brain).all()

# Combine multiple filters
results = ln.Artifact.filter(
    cell_types__name__contains="neuron",
    organisms__name="human",
    created_at__gte="2024-01-01"
).all()

for r in results:
    print(f"{r.key}: {r.description} ({r.size} bytes)")

Configuration

Parameter	Description	Default
`storage`	Root storage path or cloud URI	Required
`schema`	Schema modules to load	`"bionty"`
`name`	Instance name	Directory name
`db`	Database backend (SQLite or Postgres)	`"sqlite"`
`cache_dir`	Local cache for cloud artifacts	`"~/.cache/lamindb"`
`auto_connect`	Connect to instance on import	`true`

Best Practices

Use descriptive artifact keys — Organize artifacts with meaningful path-like keys (projects/pbmc/raw/sample_001.h5ad) rather than generic names. This makes browsing the data store intuitive and enables prefix-based queries.
Annotate with ontology terms early — Attach cell type, tissue, organism, and disease annotations when you first register an artifact. Retroactively annotating hundreds of datasets is tedious and error-prone.
Track every analysis step — Call ln.track() at the start of every notebook or script. This automatically records lineage so you can trace any result back to its raw data and code, which is critical for reproducibility.
Version artifacts instead of overwriting — When re-processing data, save as a new artifact version rather than overwriting. LaminDB's versioning lets you compare results across processing iterations and roll back if needed.
Use Collections for experiment groups — Group related artifacts into Collections (e.g., all samples from one sequencing run) to simplify batch queries and downstream analysis pipelines.

Common Issues

Schema validation errors on save — When saving artifacts with biological annotations, all ontology terms must be registered in the current instance. Run bt.CellType.from_values(terms) to auto-register missing terms before calling artifact.save(), or use ln.save(terms) to bulk-register.

Storage permission errors with cloud backends — When using S3 or GCS storage, ensure your credentials have both read and write access to the bucket. LaminDB writes metadata to the database and files to storage simultaneously — partial permissions cause silent failures where metadata exists but files are missing.

Slow queries on large instances — SQLite backends slow down with more than 100,000 artifacts. Switch to PostgreSQL for production instances using ln.setup.init(storage="s3://bucket", db="postgresql://user:pass@host/db") and add indexes on frequently queried fields.

⚠️ Loading Issue

Ultimate Lamindb Framework

Ultimate LaminDB Framework

When to Use This Skill

Quick Start

Core Concepts

Data Model

Lineage Tracking

Querying with Biological Ontologies

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace