Andrew Vincent O'Connor
Senior Site Reliability Engineer | Kubernetes, Multi-Cloud & ML Infrastructure
andrewoconnor@outlook.com
301-624-9886
Summary
Senior Site Reliability Engineer with 10+ years in software and infrastructure engineering, including recent experience building and operating production ML infrastructure in regulated healthcare environments. Strong background in Kubernetes, Amazon EKS, multi-cloud AWS/GCP infrastructure, Terraform, CI/CD automation, LLM inference deployment, observability, and GPU-based training systems. Known for owning projects end-to-end, improving reliability and cost efficiency, and partnering closely with AI and engineering teams to operationalize machine learning systems in production.
Experience
Senior Site Reliability Engineer
March 2024 - Present | Imagen Technologies

• Own deployment, reliability, performance, and cloud infrastructure for FDA-approved, AI-powered medical devices in a HIPAA-regulated environment, partnering with AI and software teams to operationalize ML systems in production.

• Provisioned large-scale AI training infrastructure on GCP, including a 256-GPU NVIDIA H100 Slurm cluster with high-performance shared storage, enabling distributed training and removing checkpointing bottlenecks.

• Productionized LLM inference on AWS by packaging fine-tuned models as Docker/vLLM images, publishing 100GB+ containers to Amazon ECR via GitHub Actions, and deploying to SageMaker.

• Led a Lambda-based inference initiative that improved scalability and reduced cost by 23% versus EC2.

• Built a production-grade GCP Dataflow pipeline to de-identify and catalog 1.67PB of medical images in one month, reducing run costs by ~75% and enabling compliant training data preparation.

• Applied AWS Bedrock Data Automation to extract text from image-based medical reports, improving access to unstructured clinical data for downstream AI and analytics workflows.

• Strengthened multi-cloud security for ML workloads by implementing workload identity federation for EKS workloads accessing GCP resources without long-lived credentials.

Site Reliability Engineer
August 2022 - March 2024 | Imagen Technologies

• Built centralized observability and alerting for critical production systems using CloudWatch dashboards, Okta-secured access, and PagerDuty integrations, improving operational visibility, incident response, and service reliability.

• Created reusable Terraform modules adopted across 20+ AWS accounts to standardize CloudWatch dashboards, GitHub Actions OIDC federation in IAM, AWS Lambda infrastructure, and AWS Verified Access endpoints.

• Led rollout of OIDC-based AWS authentication across CI/CD pipelines, eliminating static credentials and improving security for infrastructure delivery.

• Automated AWS operational workflows with Step Functions, Lambda, and Systems Manager to orchestrate tasks such as EC2 patching, improving security, standardization, and reliability in production environments.

• Extended Terraform-based CI/CD delivery across AWS and GCP, improving consistency and automation for multi-cloud infrastructure.

• Modernized the EKS-based infrastructure delivery platform, reducing cost by ~83% while improving maintainability and supportability for containerized workloads.

• Built immutable image pipelines with AWS CodeBuild and EC2 Image Builder to support reliable application releases.

Technical Consultant
February 2021 - August 2022 | Philips

• Installed and configured the PerformanceBridge radiology analytics platform and supporting systems across Linux and Windows Server environments for healthcare customers in multiple regions.

• Resolved complex customer concerns and technical issues through deep investigation, research, and reproduction.

• Automated recurring configuration tasks, created procedural documentation, and provided training to colleagues, improving operational consistency and onboarding.

• Supported product expansion into EMEA and APAC through client implementations and technical configuration for organizations including:

  • medneo GmbH
  • King Faisal Specialist Hospital Saudi Arabia
  • Chiba University Hospital
Software Engineer
March 2020 - February 2021 | Azenta / Brooks Life Sciences

• Built high-performance cloud-based applications for biobanking and laboratory automation.

• Won employee Key Strategy award for contributions to COVID-19 projects.

• Streamlined COVID-19 testing and reporting workflows, increasing efficiency and reducing errors for global biobanking processes.

• Designed a sample lineage interface for the UK NHS COVID-19 system, enabling the processing of hundreds of thousands of tests per week.

• Developed a sample normalization workflow for Curative Korva Labs, which conducted 20% of all tests in California, and integrated with state and local public health departments for reporting results.

Software Engineer
December 2017 - February 2020 | RURO Inc

• Built Rails-based software for clients including Roche and the National Institutes of Health.

• Reduced sample turnaround times for multiple clients from weeks to hours.

• Designed integrations with hardware devices and external systems including:

  • Scientific instruments (ASTM E1394)
  • Billing interfaces (HL7)
  • Payment systems (Stripe)
  • EHR/EMR systems (Epic)
  • Insurance preauthorizations
Senior Associate
June 2016 - December 2017 | Avalere Health

• Designed and built analytics tools for the post-acute care industry.

• Built web scrapers to collect plan data from state health insurance exchanges.

• Rebuilt a Tableau-based product using open-source software and on-demand cloud solutions, lowering costs by 50%.

Junior Software Engineer
June 2015 - May 2016 | RURO Inc

• Designed and developed a medical sample inventory application for an Android RFID reader.

• Integrated with hardware devices including:

  • Barcode readers
  • Signature pads
  • Robotic freezers
Education
Towson University
January 2010 - December 2012 | Towson, MD
Bachelor of Science in Computer Science
Skills
Kubernetes / Containers / ML Infrastructure
Kubernetes, Amazon EKS, Docker, Amazon ECR, Amazon SageMaker, vLLM, Slurm, containerized ML workloads, model inference deployment, production ML systems, GPU training infrastructure
Cloud / Infrastructure
AWS (EC2, ECS, EKS, Lambda, SageMaker, Step Functions, S3, ECR, CloudWatch, IAM, RDS, SNS, SQS, VPC, Systems Manager), GCP, multi-cloud infrastructure, workload identity federation
Infrastructure as Code / CI/CD
Terraform, CloudFormation, Ansible, GitHub Actions, CodeBuild, EC2 Image Builder, CircleCI, OIDC federation, immutable infrastructure, ML CI/CD
Reliability / Operations
CloudWatch dashboards, monitoring, alerting, PagerDuty, incident response, reliability engineering, operational automation, Okta
Backend / Data
Python, Ruby, PostgreSQL, Nginx, GCP Dataflow
Healthcare / Compliance
HIPAA-compliant systems, HL7, Epic, radiology workflows, laboratory workflows, medical imaging data, clinical document processing, insurance preauthorizations
Certs
AWS Certified Solutions Architect - Associate
Projects
drumroll.world website
drumroll.world
Explore drumming history visually
Languages
English
Native
Spanish
Native