Sr. Site Reliability Engineer - Incident Management Job at Walmart, Bentonville, AR

cUdSTDg4OUFNM2p0eFdhR0lTVmVTREVRZEE9PQ==
  • Walmart
  • Bentonville, AR

Job Description

About the Team: Building the right technology foundation for Infrastructure; platforms is vital to success at the scale of Walmart. Our team builds and maintains the foundational technologies that support the tech organization. Included in this are data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. All of these products and services are supported by scalable and powerful infrastructure, ensuring a secure and seamless employee and customer experience across stores, digital channels, and distribution centers.

About the Role: As a Technical Duty Officer (TDO) within the Global Technology (GT) CCC team you will work with other CCC, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability of Walmart’s technology stack. You're right for the job if you are comfortable leading major incident response with a technical team of engineers laser focused on restoring service across complex distributed systems. To successfully achieve this, you will draw upon your knowledge of the tech stack and tools to surface key data. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation “always up” cloud-based technology platforms. The TDO will take command and control of Major Incidents focusing on restoration by identifying and coordinating with appropriate resources through all the phases of triage, restoration and validation. You will understand the technology stack and use this knowledge to ensure systems continue to meet production ready standards, Operational Excellence is key! Good judgement is crucial as you will own and deploy critical switches for command/control mitigation tooling. Your ability to continuously challenge yourself and develop a strong network with peers and stakeholders cross functionally will see you exceed in this role. Our goal is to protect the customer, merchant and associate experience and deliver outstanding levels of availability across Walmart Global Technology. To do so, you will need strong skills in the following areas:

  • Control incident management processes and procedures.
  • Calm under pressure when controlling major incident response.
  • Excellent end to end technical understanding of core infrastructure, cloud services, platforms and micro-services.
  • Ability to understand traffic flows and key dependencies between services.
  • Ability to understand and capture key data from various sources, systems and people.
  • Ability to effectively triage – be able to detect and determine symptom vs cause.
  • Act as a technical leader and coach within the CCC.
  • Analyze trends to pro-actively prevent incidents.
  • Focus on leading immediate restoration vs root cause.
  • Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
  • Development of monitoring and alerting frameworks.
  • Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
  • Strong focus on collecting and inferring metrics.
  • Influence the design of system architecture and tactical solutions.
  • Provide data for and actively participate in root cause analysis partnering with the Problem Management function.
  • Design and implement technical solutions and process improvements to improve detection and resolution response that prevent repeat issues.
  • Understanding and consideration of business strategies and priorities.
  • Coach and help develop CCC engineers.
  • Familiar with log centric tooling, ideally Splunk. Produce time series data and reusable dashboards for use both during and post event.
  • Define the standards and requirements for new service onboarding, ensuring they are fit of production.
  • Drive standardization and service focused instrumentation.

Additional responsibilities may include:

  • Share knowledge globally between CCC teams.
  • Strive for continuous improvement and make recommendations based on CCC process.
  • Other duties and responsibilities as assigned.

Qualifications:

  • 7+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
  • Bachelor's Degree in Computer Science or a related field, or relevant work experience.
  • Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
  • Experience and exposure working in a 24/7 operations support environment.
  • Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive.
  • Experience investigating, analyzing and troubleshooting large scale enterprise systems.
  • Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
  • Experience administering Unix/Linux in a production environment.
  • Experience working with and developing enterprise monitoring/tooling/logging solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic, DynaTrace and Prometheus.
  • Working knowledge of one or more cloud technologies such as AZURE, GCP, OpenStack.
  • Experience with distributed version control like Git or similar.
  • Familiarity with continuous integration/deployment processes and tools such as Jenkins, Maven, Nexus, etc.
  • Programming experience in one or more of the following languages: Go, Java, Python, Shell, etc.
  • Experience in data science/machine learning would be advantageous.

Pay range and compensation package: Pay range or salary or compensation

Equal Opportunity Statement: Include a statement on commitment to diversity and inclusivity.

Job Tags

Work experience placement, Immediate start,

Similar Jobs

Amergis Healthcare Staffing Inc APP

Locum Tenens Physician Assistant - Hematology / Oncology - $85-100 per hour Job at Amergis Healthcare Staffing Inc APP

Amergis Healthcare Staffing Inc APP is seeking a Physician Assistant Hematology / Oncology for a locum tenens job in Chicago, Illinois. Job Description & Requirements ~ Specialty: Hematology / Oncology ~ Discipline: Physician Assistant ~ Duration: 26 weeks ...

DoorDash

Delivery Driver - Earn Extra Cash Job at DoorDash

 ...groceries, and retail orders to earn extra cash. The job offers a clear pay model with instant cash flow, making it ideal for those seeking a...  ...Total flexibility: Dash when it works for you. Set your own hours and work as muchor as littleas you want. Know how much you... 

Vohra Physicians

Physician / Surgery - Vascular / Tennessee / Permanent / Vascular Surgeon Job Job at Vohra Physicians

 ...Part-Time Vascular Surgeon Position in Knoxville, TN with National Wound Physician Group Clinical satisfaction, autonomy and flexibility, work-life balance No Nights, No Call, No Weekend Clinical Responsibilities Become an expert in wound management, learn... 

Dali Transportation LLC dba Prime One

CDL Class A Team Drivers - OTR Job at Dali Transportation LLC dba Prime One

 ...(Fridge, Inverter & More). ~ Dedicated Routes - No Guesswork, Just Miles & Money. ~24/7 Support - We Handle the Logistics, You Drive. Company Team Drivers: ~ Earn Up to $0.85/Mile. ~ Run Up to 6,000 Miles/Week. ~ Dedicated Routes - No Sitting, No Downtime... 

Envy Studio LLC

BRIDAL HAIR AND MAKEUP STYLISTS NEEDED! Job at Envy Studio LLC

Busy Bridal Salon is looking for weekend Bridal Makeup and Hair Stylists! Top Ranked, Busy Bridal Salon looking for Bridal Hair and Makeup Artists! Our busy wedding season runs April - October, but we have weddings scheduled all throughout the year. We typically have...