R&D
Senior SRE & Linux Infrastructure Engineer - ML Platform
Mobileye’s ML Platform group builds and operates the core infrastructure that powers large scale AI workloads. We manage a massive, high performance environment consisting of both multi cloud clusters and on prem bare metal nodes optimized with AI accelerators. We are looking for a highly experienced Senior SRE / Linux Systems Engineer who thrives on managing complex, low level infrastructure. This isn't just a cloud-configuration role, you will be responsible for the health and performance of expensive, high density hardware. You must be an expert at troubleshooting open source systems and "living" inside Linux environments to ensure our AI clusters run at peak efficiency.
What will your job look like?
- Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environments
- Operate and enhance our multi‑cloud, multi‑cluster scheduling platform
- Troubleshoot complex issues across the stack: from kernel-level tuning and drivers to networking, storage, and distributed system bottlenecks.
- Ensure the reliability of critical platform services: queuing systems, time-series databases, and logging pipelines
- Develop deeply integrated automation and tooling
- Collaborate with ML engineers and IT engineers to optimize hardware utilization for data intensive workloads
- Drive best practices in system design, observability, and infrastructure-as-code
All you need is:
- 10+ years of hands‑on experience in SRE, Linux Administration, or Systems Engineering
- Expert-level Linux knowledge: Deep understanding of system internals, debugging, performance tuning, and the ability to solve failures where hardware meets software.
- Kubernetes Expertise: Proven experience managing K8s at scale (both managed EKS and bare-metal deployments)
- Distributed Systems Mastery: Hands-on experience debugging and maintaining:Queuing Systems: RabbitMQ or similar
- Metrics/Observability Stacks: Prometheus, Thanos, and Grafana, or similar
- Logging: Elasticsearch or similar
- Relational Databases: PostgreSQL, or similar
- Infrastructure-as-Code: Proficiency with Terraform, Helm, and configuration management
- Networking & Scripting: Strong fundamentals in networking and proficiency in Bash
- Familiarity with GPU/Accelerator scheduling, AI/ML pipelines
- Experience with multi cloud architectures and hybrid environments
- Experience with workflow orchestration tools (e.g., Argo Workflows)
What We Offer:
- IImpact: Support the engineering that advances Mobileye’s AI and global transportation safety
- Cutting-Edge Hardware: Work with high-value, AI-optimized bare-metal clusters at a massive scale
- Technical Depth: A highly technical environment focused on solving deep systems engineering challenges
- Collaboration: Work alongside elite ML, software, and systems engineers
Mobileye changes the way we drive, from preventing accidents to semi and fully autonomous vehicles. If you are an excellent, bright, hands-on person with a passion to make a difference come to lead the revolution!


