Pipette for plant bioinformatics and agricultural genomics: GBLUP, QTL mapping, plant GWAS, and the rest of the breeders' toolkit
Variome Analytics
May 10, 2026
We have spent most of Pipette's life serving human and biomedical genomics. The platform's core, an LLM agent that writes and executes analysis code on demand, has always been domain-agnostic. The reason it looked biomedical was simply because that is where the early users came from. Over the last few weeks we have closed the gap on the other half of applied genomics: animal and plant breeding. This post is a tour of what landed.
If you work on cattle, dairy, swine, poultry, sheep, maize, wheat, rice, soybean, sorghum, or any other production species where you are estimating breeding values, mapping QTLs, or running marker-trait associations on diverse germplasm, Pipette now does this end to end. You bring genotype and phenotype files. You ask questions in plain English. The agent loads the right skill, picks the right method, runs it, hands you GEBVs or LOD curves or PCs with a written interpretation.
Genomic prediction is the centerpiece
We have shipped a genomic-prediction skill backed by three of the most widely used R packages in the field.
rrBLUP for ridge-regression GBLUP. The textbook approach. Fast on big marker matrices, single-line API for fitting and prediction, the default choice when the trait is highly polygenic and you want one accurate number per individual.
BGLR for the full Bayesian regression family. BayesA, BayesB, BayesC, BayesR, Bayesian LASSO, Bayesian Ridge. Useful when you suspect a small number of large-effect QTLs, when you want posterior intervals on marker effects, or when you want to compare priors against rrBLUP on the same cohort. The skill exposes all of them.
GCTA for narrow-sense heritability via REML, and for joint conditional analysis on GWAS summary statistics. Even when h2 is not the headline result, computing it before fitting a prediction model tells you the upper bound on the accuracy you can expect from a marker panel. The skill runs GCTA REML as an optional preliminary step.
The skill takes a genotype matrix or a VCF, a phenotype file, and an optional pedigree. It mean-imputes missing genotypes, MAF-filters, fits the chosen model, runs five-fold cross-validation, reports per-individual GEBVs with the cross-validated correlation as the only honest accuracy metric, and lists the top markers by absolute effect. It explicitly refuses to report selection or culling recommendations. That is the breeder's call, not ours.
What we deliberately skipped in the first release: sommer for multi-environment GxE plant-breeding mixed models is not on bioconda, only on CRAN. The skill installs it from CRAN at runtime if the user asks. BLUPF90 is not on bioconda either and needs a source build, so single-step ssGBLUP is not in the default image yet. Both are on the roadmap.
QTL mapping for biparental crosses
R/qtl is in the image. The qtl-mapping skill targets the structured population side of plant and animal breeding: F2, backcross, recombinant inbred lines, doubled haploid populations. Interval mapping, composite interval mapping, multi-QTL search via stepwiseqtl, empirical LOD thresholds via permutation. The skill emits per-trait LOD scans, peak tables with 1.5-LOD intervals, and a multi-QTL fit summarising variance explained per peak.
For the more recent MAGIC and advanced-intercross populations, the skill notes that R/qtl2 is the better tool and installs it from CRAN at runtime when relevant.
The skill is explicit about what biparental QTL mapping cannot do: with one or two recombinations between any two markers in your generation, your 1.5-LOD intervals will be wide. Fine-mapping to a candidate gene needs more recombinants or a panel-based approach. We do not pretend otherwise in the output.
Plant-breeding GWAS with TASSEL
tassel is in the image. The plant-gwas-tassel skill runs MLM, the mixed linear model with kinship and population-structure covariates, and GLM, the general linear model with PC covariates. This is the standard pipeline in NAM, MAGIC, and diverse germplasm panels across maize, wheat, rice, soybean, sorghum, and other field crops.
The skill takes a VCF or HapMap genotype file plus a phenotype matrix in TASSEL's expected format. It computes a centered IBS kinship matrix and the first five PCs inside TASSEL, fits the MLM, and outputs significant marker associations at user-chosen thresholds with effect sizes and allele frequencies. It separately reports the genomic inflation factor and warns when residual confounding suggests adding more PCs or a different kinship method.
TASSEL is Java-based and shares the worker's pinned openjdk=17 with GATK and Beagle. We made that pin explicit precisely because three popular Java tools sharing one runtime is a known source of solver conflicts and breaks if not pinned.
Imputation for low-density chips
Cattle and dairy pipelines commonly impute low-density genotyping chips up to high density before scoring. The new imputation skill wraps Beagle 5 for phasing plus reference-panel imputation in a single pass. The skill stages the reference panel from S3 (we serve 1000 Genomes Phase 3 and can stage others on request), runs Beagle per-chromosome, filters the output by Beagle's DR2 info score with sensible default thresholds, and concatenates the chromosomes back into a cohort-level imputed VCF.
The same skill works on low-coverage WGS and on human cohorts that need pre-GWAS imputation. The imputation method does not care about species. The reference panel does.
Population structure and kinship
The population-structure skill uses Bioconductor's SNPRelate to compute PCA, IBD relatedness, and a genetic relationship matrix from a cohort's genotypes. This is the matrix you want to pass into a mixed model as the covariance structure, and these are the PCs you want as GWAS covariates. The skill LD-prunes the SNP set before fitting, filters by MAF, and reports the variance explained by each PC plus the top related pairs by IBD kinship.
A neighbouring skill, sample-swap-check, uses Somalier on BAM/CRAM/VCF to surface unexpected relatedness against an expected pedigree. We extended that skill to also run VerifyBamID2 for cross-sample contamination estimation, since detecting an FREEMIX above two percent matters as much for production-tier livestock pipelines as it does for clinical sequencing.
How the agent strings these together
The skills are not a flat menu. Pipette's agent picks one or chains several based on what the user asked. A typical breeder-facing conversation might go:
"I have 800 Holstein dairy cattle with 50K SNPs and milk-yield phenotypes. Compute genomic estimated breeding values and tell me the prediction accuracy."
The agent loads genomic-prediction, runs GCTA REML for h2 first (the user will want context for the cross-validation accuracy it is about to report), fits rrBLUP on a mean-imputed genotype matrix, runs five-fold CV with a leave-family-out option if the pedigree is provided, and returns GEBVs, the CV correlation, and the heritability estimate.
"Same panel. Are there any large-effect QTLs worth following up?"
The agent switches to BGLR with BayesB, refits the same data, returns the top markers by posterior inclusion probability, and notes that this is a screening result, not a fine-mapping one. If the user wants a credible set, the agent suggests the fine-mapping skill (SuSiE on the BGLR sumstats) as the next step.
"Now imagine I have an F2 cross of two contrasting inbreds for the same trait. Where are the peaks?"
The agent loads qtl-mapping, expects a R/qtl CSV with chromosome, position in cM, and per-individual genotypes coded as A/H/B, runs scanone with HK method and 1000-permutation thresholds, and follows with stepwiseqtl for the multi-QTL fit.
None of these workflows require the user to know which package is being called. They do require the user to bring data in a coherent format, with sample IDs that match across genotype and phenotype files. The skills are blunt about format expectations and bluntly refuse to run if IDs do not align.
What we still owe the breeding community
A few honest gaps.
Multi-environment GxE for plant breeding is best served by sommer (CRAN) or by lme4 for simpler cases. Today the genomic-prediction skill installs sommer at runtime if the user explicitly asks for multi-environment fits. We will bake it into the image properly once we see enough demand.
ssGBLUP (single-step GBLUP with pedigree and genotypes combined) is the gold standard for production-scale livestock evaluation. The BLUPF90 family from the Misztal lab does this best. It is not on bioconda and requires building from source. We will get to it but it is not next in the queue.
Phenotype harmonisation across multi-year, multi-environment trials is a real workflow we have not abstracted into a skill. Researchers do it case-by-case in R. The agent can write that code on the fly, but a dedicated skill would help when the cohort spans many environments and seasons.
Polyploid QTL mapping for potato, sugarcane, and similar species needs mappoly (CRAN). The agent can install it at runtime. We have not built a polyploid-specific skill yet.
Selection-index optimisation combining multiple trait GEBVs with economic weights is something breeders ask for and is not yet a skill. It is a few lines of R that lives inside the existing prediction skill's output for now.
The compute side, briefly
Behind the skills is the rest of Pipette: a queue-aware compute platform that escalates heavy jobs automatically. A 200K-animal pedigree fits on the default worker. A 500K-animal cohort or a Bayesian fit with millions of markers and many MCMC iterations will trip the default queue's memory ceiling and the platform routes the retry to a larger instance with 128 GB of RAM, all without the user needing to know what rna_large means.
Billing-wise, you pay only for compute that produced something. OOMs, errors, timeouts do not consume credits. That has always been true and matters more for breeders running Bayesian fits where convergence is not guaranteed and you can burn a few hours discovering that.
How to try it
Pipette runs at pipette.bio. Sign up, get monthly free credits, and try one of:
- "Compute GEBVs for milk yield from this genotype matrix and phenotype TSV. Use rrBLUP and report five-fold CV accuracy."
- "Map QTLs for plant height in this F2 cross. Use
R/qtlwith 1000 permutations to set the LOD threshold." - "Run a TASSEL MLM GWAS on this maize NAM panel. Kinship from centered IBS, three PCs as covariates."
- "Impute this 50K cattle genotype set up to the 1000 Genomes Phase 3 panel."
If you have a specific tool we missed, the feedback button in the sidebar lands directly with the team. We read every message and reply within a day.
The promise we make to the breeding community is not that Pipette is faster than your hand-rolled R scripts. For routine work on data you already know how to handle, your scripts are fine. The promise is that exploratory work, the kind where you are unsure which method is right, where the format of the new collaborator's data does not match what you usually expect, or where you just want a second opinion on what to try next, gets cheap and fast. The agent reads the data, picks a reasonable method, runs it, shows you the result, and explains the choice. You can argue with it. You can ask for a different method. The conversation stays open.
That is the workflow we have been refining for the last year. The agricultural genomics skills are the newest piece. If they save you the time it would have taken to remember which BGLR prior fit your last cohort best, we have done our job.