NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS


The NLM DIR holds a public weekly seminar series for NLM trainees, staff scientists, and investigators to share details on current and exciting research projects at NLM. Seminars take place on Tuesdays at 11:00 AM, EST and some Thursdays at 3:00 PM, EST. Seminars are held in the B2 Library of Building 38A on the main NIH campus in Bethesda, MD.

To schedule a seminar, click the “Schedule Seminar” button to the right, select an appropriate date on the calendar to sign up, and then complete the form. You will need an NIH PIV card to access the “Schedule Seminar” page.

Please include seminars by invited visiting scientists in the NLM DIR seminar series. These need not be on a Tuesday or Thursday.

If you would like to schedule a seminar by a visiting scientist, click the “Schedule Seminar” and complete the form. Contact NLMDIRSeminarScheduling@mail.nih.gov with questions. Please follow this link to subscribe/unsubscribe to/from the NLM DIR seminar mailing list.

Titles and Abstracts for Upcoming Seminars


(based on the current date)

John Bridgers
May 12, 2026 at 11 a.m.

A bi-partition function algorithm to evaluate inferred subclonal structures in single-cell sequencing data

Clonal evolution of cancer results in intratumor heterogeneity, making treatment and cure challenging. Single-cell sequencing has advanced our understanding of intratumor heterogeneity, but tracing subclonal evolution using mutational profiles of cells is limited by scale and noise. Moreover, available tumor progression tree inference methods usually offer a single tree to explain the progression of a tumor, and do not inform about alternative evolutionary scenarios. We introduce the bi-partition function for a tumor progression tree, to assess the reliability of any proposed subclonal structure in a single-cell sequenced tumor. By using the bi-partition function, we calculate the probability that any given subset R of mutation-profiled single cells from a tumor forms a clade rooted by a specified mutation ρ across all possible tumor progression trees. This provides the means to evaluate whether R forms a subclone with ρ as a possible subclonal driver, which is especially useful if the cells of R are biologically or clinically significant, e.g., have aggressive growth, therapy resistance, or metastatic potential. We also introduce an algorithm to estimate the bi-partition function, which treats the ground truth as a probability distribution derived from mutational profiles of single cells and samples a tumor progression tree from this distribution independently in each iteration. We prove that our algorithm’s estimate of the bi-partition function asymptotically approaches the ground truth and demonstrate its accuracy on simulated data. Applying our algorithm to the tumor progression tree inferred from single-cell-derived melanoma sublines revealed that, while major clades and their root mutations are robust, (i) the placement of one clade in the tree is unreliable, which we later observed to be a result of Loss of Heterozygosity, and (ii) some of the mutations identified as false positives in the tree are unreliable, which later turned out to be the result of a doublet - a subline which has contamination from another subline. Interestingly, bootstrapping, a technique commonly employed for species trees, failed to point out any of these issues. After correcting the input data for these issues, the reliability of the progression tree improved substantially, demonstrating how our bi-partition function algorithm can aid studies on tumor evolution and intratumor heterogeneity.

Brandon Colelough
May 14, 2026 at 3 p.m.

TBD

Leann Lindsey
May 19, 2026 at 11 a.m.

Are Genomic Language Models Learning? Insights from Tokenization Analysis and Prophage Detection in Bacterial Genomes

Genomic language models (gLMs) promise to decode the regulatory and functional logic encoded in DNA, yet whether current architectures learn meaningful biological representations remains contested. Recent studies question the foundational abilities of gLMs, demonstrating that they fail to outperform randomly initialized or simple supervised models on standard benchmarks, while model authors point to zero-shot performance and unsupervised motif discovery as evidence of foundational biological understanding. We present two complementary efforts to investigate this question. First, we systematically evaluate how tokenization strategy (nucleotide, k-mer, and byte-pair encoding) affects model behavior across three genomic benchmarks, probing whether token granularity shapes what gLMs capture at the nucleotide level. Second, we introduce LAMBDA, a genomic language model benchmark that leverages bacteriophages as a test system to investigate the annotation abilities of genomic language models. Unlike well-annotated model organism genomes, the vast majority of phage genomes remain poorly characterized, making them an ideal domain for testing whether gLMs identify meaningful sequence patterns beyond homology. LAMBDA evaluates gLM embeddings through phage-bacteria discrimination tasks of increasing complexity, including genome-wide prophage detection, and provides a rigorous framework for evaluating model performance on a genome-wide annotation task with direct relevance to microbiology and medicine.