Data Scientist + Engineer

Pacific Northwest

About Me

I’m a full-stack data professional passionate about helping companies make their data useful with machine learning and data engineering.

When I’m not thinking about data, I like to backcountry ski, trail run, and take care of my plants.

Interests

  • Building orchestration pipelines
  • Implementing ML + AI models described in literature
  • Designing vizualization tools to tell stories about data

Education

  • PhD in Biomedical Engineering, 2021

    University of Washington

  • BSc in Molecular and Cellular Biology, 2012

    University of California, Los Angeles

What Am I Doing Now?

 
 
 
 
 

Senior Data Scientist

Just-Evotec Biologics

Mar 2022 – Present Seatte, WA
  • Designed and deployed ELT pipelines supporting data lakehouse architectures for multi-scale and cross-functional biomanufacturing data. Enhanced data integrity, availability and traceability in accordance with best-practice data governance policies.
  • Owned the design and implementation of cloud-native infrastructure supporting critical security capabilities, application scaling and availability, and real-time event-driven data synchronization for business processes. Set up and managed multiple cloud environments with IaC tools. Expedited application development cycle and increased deployment robustness.
  • Accelerated biotherapeutic discovery by leading the development of LLM-based AI models for targeted antibody design, antibody humanization, and codon optimization. Owned E2E ML development lifecycle: curating protein datasets, developing processing pipelines, deploying models as horizontally scalable services behind APIs for inference, tracking model experiments.
  • Developed suite of APIs, visualization tools, and analysis applications using micro-service framework to standardize data analyses, increase data access, and automate report generation. Increased project throughput and reduced costs.

Past Roles

 
 
 
 
 

Data Scientist

CuriBio

Jun 2021 – Apr 2022 Seatte, WA
  • Responsible for developing and deploying image processing AI models for predicting cell differentiation success likelihood from high-throughput microscopy imaging datasets. Reduced material resource costs by upwards of 25% for associated research stage.
  • Developed client-side waveform analysis software for characterizing contractility profiles of engineered cardiac and skeletal muscle stem cells. Leveraged signal processing algorithms to analyze impact of therapeutics on muscle cell function.
 
 
 
 
 

PhD Graduate Student, University of Washington

Integrated Brain Imaging Center

Sep 2014 – Sep 2021 Seatte, WA
  • Developed graph neural network approaches to segment cortices of medical brain images. Trained models yielded improvements in classification accuracy of 8% over conventional image alignment algorithms. Improved test-retest reliability of patient-specific segmentations by 6% across clinical scanning sessions.
  • Applied novel modal decomposition algorithm (DMD) for studying fMRI brain dynamics that out-performed state-of-art (ICA) at identifying canonical activation networks. Increased test-retest reliability of detected networks by 7% while requiring shorter duration MRI scanning sessions than state-of-art.
  • Designed novel approach for analyzing variability in the topography of functional brain connectivity using spatial statistical modeling. Results aligned with long-standing theories of hierarchical brain organization.
  • Developed turn-key orchestration pipeline for processing 1000+ functional and diffusion MRI scans (>1.5TB), deployed on GPU-backed high-performance computing system.
  • Awarded highly selective 3-year fellowship from the ARCS Washington Research Foundation to pursue doctorate research.
 
 
 
 
 

Software Engineering Intern

Phase Genomics

Apr 2017 – Jun 2017 Seattle, WA
 
 
 
 
 

Data Science Intern

Pacific Northwest National Laboratory

Jun 2016 – Sep 2016 Richland, WA

Skills and Technologies

Data Science

APIs (FastAPI), Python data stack (numpy, scipy, pandas, etc.), visualization (Dash, Plotly, Streamlit)

Cloud infrastructure

AWS (EC2, ECS, EKS, EventBridge, IAM, Lambda, RDS, S3, SageMaker, VPC), Terraform

Data Engineering

AI + ML (Pytorch, DGL), MLOps (MLFlow), deployment (ECS, SageMaker, Lambda), CI/CD (Gitlab), containerization (Docker), orchestration (Dagster), databases (SQL, PySpark)

Software Engineering

object-oriented design, test-driven development, data structures + algorithms

Recent Posts

CI/CD Part 4: Container Registries

This is the last post in a mini-series on designing Gitlab CI/CD pipelines. We’ve discussed the basic anatomy of a .gitlab-ci.yml file, how to set up authentication tokens and files for building and pushing packages to a registry, and designing a Dockerfile for building images from a package in the context of a CI/CD pipeline.

CI/CD Part 3: Building containers with Docker

This is the third post in a mini-series on designing Gitlab CI/CD pipelines. In the last post, we discussed setting up your .pypirc and .netrc files in the context of a Gitlab CI/CD pipeline to enable building and pushing packages to a package registry, as well as for installing code from a private registry.

CI/CD Part 2: Building and pushing packages

This is the second post in a mini-series on designing Gitlab CI/CD pipelines. In order to build packages and push them to a remote package registry, we use the build and twine packages. build generates a package, and twine pushes this package to a registry (or “index”).

CI/CD Part 1: Gitlab Pipelines

I recently developed a template workflow to help our team adopt a CI/CD-based development strategy. Many of our web applications and tools were based on simple repository structures. With growing datasets and ever-increasing use by outside teams, we found ourselves needing to add new features more frequently to many of these tools and believed that continuous integration and deployment could help us not just develop more quickly, but also more intelligently.

Visualizing SQL Schemas

I was recently tasked with examining databases related to some computer vision tools that my company had acquired. Basically, the framework was as follows… Clients/users would sign up for some service with the goal in mind of building a model to classify a set of microscopy images.

Software

parcellearning

package of neural network modules for learning cortical architectures from brain connectivity data

submet

package to compute various distance metrics between subspaces

ddCRP

package to fit distance-dependent Chinese Restaurant Process models

fieldmodel

package to fit distributions over scalar fields on the domain of regular meshes

Talks and Presentations

Automated Connectivity-Based Parcellation With Registration-Constrained Classification

(Best Talk, Honorable Mentions) Automated Connectivity-Based Parcellation With Registration-Constrained Classification

Analyzing the Resting Brain with Dynamic Mode Decomposition

Posters and Publications

Learning Cortical Parcellations Using Graph Neural Networks

We examine the utility of graph neural networks for the purpose of learning cortical segmentations. We show that attention-based transformer networks significantly outperform conventional GCN and linear feed-forward variants for the purpose of generating accurate reproducible cortical maps.

Linear Mapping of Cortico-Cortico Resting-State Functional Connectivity

Using non-linear dimensionality reduction of functional brain connectivity patterns, and multivariate spatial statistics to characterize the functional embeddings, we analyze the spatial relationships between pairs of cortical regions to better examine how pairs of cortical regions connect and relate to one another.

Extracting Reproducible Time-Resolved Resting State Networks Using Dynamic Mode Decomposition

In this paper, we develop a novel method based on dynamic mode decomposition (DMD) to extract resting-state networks from short windows of noisy, high-dimensional fMRI data, allowing RSNs from single scans to be resolved robustly at a temporal resolution of seconds. This automated DMD-based method is a powerful tool to characterize spatial and temporal structures of RSNs in individual subjects.

Automated Connectivity-Based Cortical Mapping Using Registration-Constrained Classification

In this analysis, we propose the use of a library of training brains to build a statistical model of the parcellated cortical surface to act as templates for mapping new MRI data.

Registering Cortical Surfaces Based on Whole-Brain Structural Connectivity and Continuous Connectivity Analysis

We present a framework for registering cortical surfaces based on tractography-informed structural connectivity. We define connectivity as a continuous kernel on the product space of the cortex, and develop a method for estimating this kernel from tractography fiber models.