All Jobs
Scale AI

Machine Learning Engineer - Model Evaluations, Public Sector

Scale AI|San Francisco / St. Louis / New York / Washington DC, United StatesHybrid
$187k - $300kUSDVerified
Apply Now

Job Description

Public Sector ML engineers at Scale AI design and scale automated evaluation pipelines for LLMs, agentic systems, and multimodal models deployed in mission‑critical government environments. The role focuses on building evaluation frameworks, stress tests, and red‑teaming workflows to ensure safety, robustness, and reliability of advanced AI systems used by defense, intelligence, and federal customers.

Responsibilities

  • Develop and maintain automated evaluation pipelines for ML models across performance, robustness, safety, and functional metrics, including LLM‑judge based evaluations
  • Design test datasets and benchmarks to measure generalization, bias, explainability, and failure modes
  • Build evaluation frameworks for LLM agents, including scenario‑ and environment‑based testing infrastructure
  • Conduct comparative analyses of model architectures, training procedures, and evaluation outcomes
  • Implement tools for continuous monitoring, regression testing, and quality assurance of ML systems
  • Design and run stress tests and red‑teaming workflows to uncover edge cases and vulnerabilities
  • Collaborate with operations teams and subject‑matter experts to produce high‑quality evaluation datasets

Benefits

Base salary range: $208,000–$300,000 (SF/NY/Seattle) and $187,000–$270,000 (DC/TX/CO) plus equityComprehensive health, dental, and vision coverageRetirement benefitsLearning and development stipendGenerous PTOPotential commuter stipend

Category

Machine Learning Engineer

Ready to Apply?

Applications go directly to Scale AI's career portal

Apply on Scale AI