U

Ultimate Lamindb Framework

Battle-tested skill for skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Ultimate LaminDB Framework

Build production-grade biological data management pipelines using LaminDB, a data framework purpose-built for biology. This skill covers data ingestion, artifact tracking, lineage management, and integration with popular bioinformatics tools for managing datasets across experiments and analyses.

When to Use This Skill

Choose Ultimate LaminDB Framework when you need to:

  • Track biological data artifacts with full provenance and lineage
  • Manage AnnData, FASTA, BAM, or other bioinformatics file types with metadata
  • Build reproducible analysis pipelines with automatic data versioning
  • Query and retrieve datasets across multiple experiments using biological ontologies

Consider alternatives when:

  • You need a general-purpose data warehouse (use Snowflake or BigQuery)
  • You need only file storage without biological metadata (use S3 or GCS directly)
  • You need interactive data exploration without pipeline management (use CellxGene)

Quick Start

# Install LaminDB with biological extras pip install lamindb[bionty]
import lamindb as ln import bionty as bt # Initialize a new LaminDB instance ln.setup.init(storage="./my_research_data", schema="bionty") # Register a dataset import anndata as ad adata = ad.read_h5ad("my_scrnaseq.h5ad") # Create an artifact with biological metadata artifact = ln.Artifact.from_anndata( adata, description="Single-cell RNA-seq of human PBMC", key="datasets/pbmc_10k.h5ad" ) # Annotate with ontology terms cell_types = bt.CellType.from_values(adata.obs["cell_type"].unique()) artifact.cell_types.set(cell_types) artifact.save() print(f"Saved artifact: {artifact.uid}")

Core Concepts

Data Model

ComponentPurposeExample
ArtifactAny data object (file, array, DataFrame)H5AD file, CSV, FASTA
CollectionGrouped artifacts with shared contextAll samples from one experiment
TransformCode that creates or modifies artifactsJupyter notebook, Python script
RunSingle execution of a transformAnalysis run on 2024-01-15
FeatureMeasured variable or annotation columnGene name, cell type label
ULabelUniversal label for categorizationTissue type, disease state

Lineage Tracking

import lamindb as ln # Track a transform (analysis script) ln.track("my_analysis_v2") # Load input artifacts raw = ln.Artifact.filter(description__contains="raw counts").one() adata = raw.load() # Perform analysis import scanpy as sc sc.pp.normalize_total(adata) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata) sc.tl.pca(adata) sc.tl.umap(adata) # Save output — lineage automatically links input → transform → output processed = ln.Artifact.from_anndata( adata, description="Normalized and processed PBMC data" ) processed.save() # Query lineage print(processed.run) # Which run created this? print(processed.run.input_artifacts.all()) # What went in? print(processed.transform) # What code was used?

Querying with Biological Ontologies

import lamindb as ln import bionty as bt # Find all artifacts annotated with specific cell types t_cells = bt.CellType.filter(name__contains="T cell").all() artifacts = ln.Artifact.filter(cell_types__in=t_cells).all() # Search by tissue brain = bt.Tissue.filter(name="brain").one() brain_data = ln.Artifact.filter(tissues=brain).all() # Combine multiple filters results = ln.Artifact.filter( cell_types__name__contains="neuron", organisms__name="human", created_at__gte="2024-01-01" ).all() for r in results: print(f"{r.key}: {r.description} ({r.size} bytes)")

Configuration

ParameterDescriptionDefault
storageRoot storage path or cloud URIRequired
schemaSchema modules to load"bionty"
nameInstance nameDirectory name
dbDatabase backend (SQLite or Postgres)"sqlite"
cache_dirLocal cache for cloud artifacts"~/.cache/lamindb"
auto_connectConnect to instance on importtrue

Best Practices

  1. Use descriptive artifact keys — Organize artifacts with meaningful path-like keys (projects/pbmc/raw/sample_001.h5ad) rather than generic names. This makes browsing the data store intuitive and enables prefix-based queries.

  2. Annotate with ontology terms early — Attach cell type, tissue, organism, and disease annotations when you first register an artifact. Retroactively annotating hundreds of datasets is tedious and error-prone.

  3. Track every analysis step — Call ln.track() at the start of every notebook or script. This automatically records lineage so you can trace any result back to its raw data and code, which is critical for reproducibility.

  4. Version artifacts instead of overwriting — When re-processing data, save as a new artifact version rather than overwriting. LaminDB's versioning lets you compare results across processing iterations and roll back if needed.

  5. Use Collections for experiment groups — Group related artifacts into Collections (e.g., all samples from one sequencing run) to simplify batch queries and downstream analysis pipelines.

Common Issues

Schema validation errors on save — When saving artifacts with biological annotations, all ontology terms must be registered in the current instance. Run bt.CellType.from_values(terms) to auto-register missing terms before calling artifact.save(), or use ln.save(terms) to bulk-register.

Storage permission errors with cloud backends — When using S3 or GCS storage, ensure your credentials have both read and write access to the bucket. LaminDB writes metadata to the database and files to storage simultaneously — partial permissions cause silent failures where metadata exists but files are missing.

Slow queries on large instances — SQLite backends slow down with more than 100,000 artifacts. Switch to PostgreSQL for production instances using ln.setup.init(storage="s3://bucket", db="postgresql://user:pass@host/db") and add indexes on frequently queried fields.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates