Site Reliability Engineer Job at Berkley Hunt, San Jose, CA

clc1SDhzcE9NWHZvdzJlQ0pDZFJRVFlTY0E9PQ==
  • Berkley Hunt
  • San Jose, CA

Job Description

Senior Site Reliability Engineer (GPU Compute) | Hybrid – Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior/Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role:

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you're excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities:

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements:

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI/CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.

Nice to Have:

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.

Job Tags

Similar Jobs

Banner University Med Ctr Phoenix - Envision Physician Servi...

Anesthesiology - Physician Job at Banner University Med Ctr Phoenix - Envision Physician Servi...

 ...anesthesiology, emergency medicine, hospital medicine and other health services in partnership with clinical departments in diverse facilities...  ...are proud to be an EOE/AA employer. Facility Information: Banner University Medical Center Phoenix, formerly Banner Good... 

DAIRYLAND TRANSPORTATION LLC

CDL A OTR Drivers - Up To $0.63 CPM - All Miles Paid - Minnesota Job at DAIRYLAND TRANSPORTATION LLC

CDL A OTR Drivers - Up To $0.63 CPM - All Miles Paid - Minnesota at DAIRYLAND TRANSPORTATION LLC summary: Dairyland Transportation is hiring CDL A OTR refrigerated (reefer) truck drivers for Midwest and Northeast routes offering $70K$97K (paid at $0.55$0.63 CPM) with...

The Sage Group

Senior Manager, Go-To-Marketing (GTM) Marketing (San Francisco) Job at The Sage Group

 ...FinTech company, is seeking a strategic, results-driven Go-to-Market Senior Manager to lead product launches, streamline GTM processes...  ...SEO and analytics to optimize performance. ~ Ensure timely deployment of web updates in sync with product and campaign launch... 

Raytheon Technologies

MRO Material Associate I - 1st Shift Job at Raytheon Technologies

 ...customers' toughest challenges and to meet the demands of a rapidly evolving global market. What You Will Do: Key material requests into SAP Accurately pick material from storage machines with specific SAP locations and batches Deliver material to... 

Watson Psychiatry

Psychiatric Nurse Practitioner (PMHNP-BC) Job at Watson Psychiatry

 ...Psychiatry. We are seeking compassionate and talented PMHNP-BCs to join our growing team. We serve clients in the...  ... Must be licensed in the state of Colorado with a Psychiatric Mental Health Nurse Practitioner Certification (PMHNP-BC) Must live in Colorado (Hybrid...