đť’Ąohn Collins


Bioinformatics Python Software Engineering

Personalized Genomics & Biomolecular Diagnostics

NGS Pipelines | Data Science | Immuno-Oncology | Custom Full-Stack Web Apps

Written: November 2017
Last Updated: October 2022

CONTACT

805.647.5033

email me: jcollins.bioinformatics@gmail.com

Ventura, California 🌅




Links


(Download)


Contents


BIOINFORMATICS


First-hand data analysis on hundreds of next-generation high-throughput sequencing runs

Diverse Illumina® expertise (HiSeq™, MiSeq™, NextSeq™, NovaSeq™)—spanning secondary through tertiary pipeline analysis for computational interpretation of NGS raw sequence data; including, ultra-deep targeted panels, variant calling and genotyping, single-cell, unique molecular barcoding, and transcriptomics.



Advanced software development proficiency creating complex custom omics-based analytics using the Python data science stack

Daily use of SciPy, NumPy, Pandas, Jupyter, Conda (pkg./virtualenv mgmt.), scikit-learn, matplotlib, in a Linux/UNIX environment; with strong shell scripting, parallel programming, statistics, and machine learning (theano, caffe, keras) skills. Moderate fluency in R and HTML5/CSS3, and working knowledge of C/C++, Java, and JavaScript.



Extensive academic and professional history of analysis incorporating myriad open source bioinformatics tools and databases

Mapping: BWA(-MEM), Samtools, Bedtools, Bowtie2, STAR; Annotation: UCSC Genome Browser, GENCODE, Ensembl; NCBI: BLAST, RefSeq, SRA, GEO, db-[GaP/SNP/Var], ClinVar/OMIM; Cancer Genomics: TCGA, COSMIC; QC/Other: The Broad Institute—IGV/GATK/Picard/BD2K/LINCS, NIST’s GiaB (NA12878), Platinum genomes; R-Bioconductor; Google Genomics API on GCP; 1000Genomes Project, etc.



Rapidly improving command of state-of-the-art best practices for designing and deploying fast, scalable, high-performance pipelines. Namely, distributed task management for larger-than-memory size datasets (e.g., multi-genomic) using efficient data structures and MapReduce-based cluster computing.





EXPERIENCE


Bioinformatics Data Scientist

at PACT Pharma, Inc.,

South San Francisco, California

Sept 2018—Sept 2020



Supported the bioinformatic data integration of an incredible interdisciplinary immuno- oncology effort spanning the latest technology in protein sciences, gene editing, and tumor immunology to create a first-of-its-kind patient-tailored cancer-targeting T cell therapy applicable to all solid tumor types.

  • Was the principal database/lims manager for all R&D teams, leveraging a cloud platform LIMS/ELN solution called Benchling; including writing custom software using the Benchling API.
  • Created a custom sanger sequencing web application (90% Python; >10,000 lines source code) which reduced a multi-hour manual QC and analysis process of gene-edited plasmid sequencing data down to <1 minute fully-automated computation time. Application featured numerous lab scientist end user- requested UI/UX components, and was architected and deployed with 24/7 reliability and scalability allowing for multiple concurrent users on any number of machines/ browsers and parallelized computationally to allow for hundreds of samples to be processed per user session .


Bioinformatics Pipeline Developer

at Bristol-Myers Squibb,

Redwood City, California — Biologics Center

2018

Contributed new code and cloud computing infrastructure support to both: 1) supplement existing Python-backend + RShiny-frontend, wetlab user-facing, internal web apps for protein engineering sequence analysis, and 2) create from scratch a new NGS pipeline for "Antigen Receptor Repertoire" interpretation, as part of the Biologics Lead Discovery—Data Science & Bioinformatics team at the Immuno-Oncology R&D site.

  • Integrated a combination of open source bioinformatics and data science tool resources with custom python programming, shell scripting, and Docker-containerized Linux environment mgmt. for cloud-based, high-performance computing of V(D)J-gene rearrangement genomics analyses.
  • Created an original implementation of SciKitLearn's "Affinity Propagation" clustering algorithm with the python-Levenshtein edit-distance metric PyPI package, for unsupervised classification of large lists of nucleic acid/protein sequences and identification of exemplar representatives without the need for inputting an ideal number of clusters.


Bioinformatician II

—Sequencing Analysis

at Bio-Rad Laboratories,

Digital Biology Center

April/May 2017—July 2017

Created a working framework for a secondary and tertiary custom RNA-Seq pipeline for a single-cell, whole transcriptome, Unique Molecular Identifier-containing (UMI) assay called ddSEQ™ leveraging Bio-Rad’s Droplet Digital™ technology for individual cell isolation and 3’-end poly(A)-tail bead-capture along with Illumina’s “tagmentation” technology Nextera™ (adapter ligation).

  • Developed a cloud computing network architecture on Microsoft Azure, and genomically mapped samples in parallel using a scalable, distributed cluster of highly multi-core, ephemeral VM instances.
  • Wrote UMI-barcode deconvolution algorithms and integrated database genome-wide “gene” annotations to transform reads into digital transcript counts for analysis of differential gene expression.
  • Applied dimensionality reduction, decomposition, and feature enrichment to quantitatively and visually discriminate human and murine immune system cell subpopulations, from Python's scikit-learn library, such as K-means clustering, PCA, and t-SNE.


Bioinformatics Analyst

at CareDx, Inc.

(RA/Bioinformatician, 2014-2016)

November 2014—April 2017



Python-centric genotyping analysis, rigorous analytical validation, and publication-quality visualization of a deep coverage targeted SNP panel NGS diagnostic that measures “percent donor-derived cell-free DNA” (% dd-cfDNA) without prior genotyping of the donor or recipient, implementing a mix of standard and complex custom software tools with a focus on assay performance characterization.

  • Statistical Python scripting—pipeline vector-graphics reports automated to handle large sample input, generating various evolving assay-specific metrics critical for pass/fail classification such as coverage variability, error, bias, low DNA input (<3ng), heterozygous genotype %’s, and contamination.
  • Uncovering sample handling error by n2 (pairwise) correlation of allele frequencies as a means of very high confidence identity verification (visualized by hover-interactive hierarchically clustered matrix-coefficient heatmaps to reveal correct genomic patient ID→analyte).
  • Determination of Limit of Quantification by nonlinear least squares modeling on the log-transformed CV's of analytical validation spike-in replicates (measuring 0.2-12% dd-cfDNA).




PUBLICATIONS


Public sources of Python-built NGS analytics can be seen in the figures to the open-access publication below, describing a novel liquid biopsy-like assay for quantitative surveillance of allograft “genome transplant dynamics.” Published in the Association for Molecular Pathology’s official journal and selected as the cover issue presented during 2016's annual meeting.


PUBLICATIONS

Association for Molecular Pathology—

The Journal of Molecular Diagnostics

November 2016, Cover Feature.


Grskovic, et al.

Validation of a Clinical-Grade Assay to Measure Donor-Derived Cell-Free DNA in Solid Organ Transplant Patients




Fig. 3A. AlloSure™’s validated Limit of Blank is precisely 0.1% — the 95th percentile of 180 “blank” sample results (in this case, meaning 2n genomic ploidy; i.e., in which the total relative fraction of plasma-circulating allogeneic cfDNA has an expected mean value of zero percent).




EDUCATION


University of California

Santa Cruz

2013—2014

Master of Science (M.Sc.),

Biomolecular Engineering

and Bioinformatics


2009—2013

Bachelor of Science (B.S.),

Bioengineering

Graduated with Honors in the Major