Site Reliability Engineer Job at Berkley Hunt, San Jose, CA

clc1SDhzcE9NWHZvdzJlQ0pDZFJRVFlTY0E9PQ==
  • Berkley Hunt
  • San Jose, CA

Job Description

Senior Site Reliability Engineer (GPU Compute) | Hybrid – Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior/Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role:

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you're excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities:

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements:

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI/CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.

Nice to Have:

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.

Job Tags

Similar Jobs

Ingram Content Group LLC

Sr. Software Engineer (COBOL) Job at Ingram Content Group LLC

 ...Ingram Content Group (ICG) is currently seeking a Senior Software Engineer - i. Series COBOL, to join our team in La. Vergne, TN (Greater Nashville area). This person provides ongoing development and maintenance for major financial systems including Software Engineer,... 

EōS Fitness

Fitness - Manager in Training Job at EōS Fitness

 ...reports, it will prepare you for leadership responsibilities. The MIT will work closely with the Regional Manager, District Manager, and Gym Managers, influencing Key Performance Indicators (KPIs) but as this role is not assigned to a specific gym location during the... 

Confidential

Architect (Atlanta) Job at Confidential

We are hiring best-in-class architects in the early stages of their career, including graduate architects and those with 1 - 5 years' experience, to join our team and designing multi-family and mixed-use projects. The ideal candidate will be pioneering, communicative, ...

RES Consultant Group

Remote Tax Accountant - CPA Firm - Hybrid Job at RES Consultant Group

Nationally ranked CPA firm is hiring HYBRID tax professionals who have experience working in one or all of the following client industries; Construction, Real Estate, Pro Athletes, Manufacturing or Distribution. This a great opportunity to advance your tax career in this... 

Aimic Inc

Title Agent Job at Aimic Inc

 ...Responsibilities include: ~ Must be a Florida Licensed Title Agent. ~ Excellent customer service and communication skills. ~3+ years in the title industry, with demonstrable experience in processing and closing. ~ Must display a high level of professionalism in...