Machine Learning Engineer - Model Evaluations, Public Sector

Scale AI|San Francisco / St. Louis / New York / Washington DC, United StatesHybrid

$187k - $300kUSDVerified

Job Description

Public Sector ML engineers at Scale AI design and scale automated evaluation pipelines for LLMs, agentic systems, and multimodal models deployed in mission‑critical government environments. The role focuses on building evaluation frameworks, stress tests, and red‑teaming workflows to ensure safety, robustness, and reliability of advanced AI systems used by defense, intelligence, and federal customers.

Responsibilities

Develop and maintain automated evaluation pipelines for ML models across performance, robustness, safety, and functional metrics, including LLM‑judge based evaluations
Design test datasets and benchmarks to measure generalization, bias, explainability, and failure modes
Build evaluation frameworks for LLM agents, including scenario‑ and environment‑based testing infrastructure
Conduct comparative analyses of model architectures, training procedures, and evaluation outcomes
Implement tools for continuous monitoring, regression testing, and quality assurance of ML systems
Design and run stress tests and red‑teaming workflows to uncover edge cases and vulnerabilities
Collaborate with operations teams and subject‑matter experts to produce high‑quality evaluation datasets

Benefits

Base salary range: $208,000–$300,000 (SF/NY/Seattle) and $187,000–$270,000 (DC/TX/CO) plus equityComprehensive health, dental, and vision coverageRetirement benefitsLearning and development stipendGenerous PTOPotential commuter stipend

Ready to Apply?

Applications go directly to Scale AI's career portal

Apply on Scale AI