Walmart ecommerce sites handle thousands of visitors and millions of transactions per day and when something stops working, hundreds of thousands of dollars are lost. Imagine the activity, the transactions, and the data that flows through one of the largest ecommerce sites in the world. Do you have that in your mindNow imagine what it might take to keep that site up, performing and running efficiently. If you have a clear picture of that in your head, you might be the person we need for our SRE team! SRE s together will manage a large scale system made up of thousands of physical servers, request rates in the hundreds of thousands per second and data measured in petabytes. SRE team will be able to respond to production issues on a 24/7 basis and you will bring energy and relentless focus on continuous improvement within a fast paced environment. If you can still picture all of this in your head, we should talk! The job: Do what s necessary to maintain our high standard of customer experience.
Your Opportunity
As a Site Reliability Engineer for marketplace team @Walmart, you ll have the opportunity to
Enjoy working on challenges that no one has solved yet
Influence Engineering teams to design applications which are Cloud ready
Define monitoring needs for ensuring Best Customer Experience
Partner with other Engineering teams to have the right tool set to deliver Best Customer Experience on Walmart eCommerce Site
Position Responsibilities:
Assists in providing guidance to small groups of two to three engineers, including other global team associates, for assigned Engineering projects
Build and Maintain Walmart s next generation of infrastructure Platform
Administration of production infrastructure
Drive improvements in all aspects of service delivery, including change management, continuous delivery, security, monitoring and reliability Database administration in a mission-critical, 24/7 environment which include e-commerce, accounting, warehouse management and decision support systems
Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; automate response to all non-exceptional service conditions
Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
Design, implement, and support high-performance, highly-available services and infrastructure
Proactively identify tooling and automation opportunities
Automation / Internal tooling to help with various checks and communication
Maintain good communication with other stakeholders and partners
Improve the efficiency and flexibility of our datacentres
Build and maintain models for growth and capacity planning
Deployment, support and monitoring of new platforms and application stacks
Participate in new technology evaluation, design and development of highly scalable distributed databases
Explore and evaluate new technologies and solutions to push our capabilities forward, getting ahead of our customers needs, getting people incentivized to transform, innovate and continually improve
Position Requirements:
Minimum qualifications:
BS/MS in Computer Science or related field with 8-12 years of experience supporting infrastructure in a high volume of customer-facing environment
Technically strong and lead a team of 2-3 members
Took ownership and deliver at scale
Capability to program in at least one language, ideally Python or Perl, but Ruby, C/C , Java, or others are okay
Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
Experience with specific software such as Hadoop, Kafka, Spark, Cassandra, and similar technologies is desirable, but the ability to quickly learn new technology is most important
Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
Expertise in problem solving and analysing global scale distributed systems.
Logging and Monitoring experience designing, deploying and running systems like Splunk, ELK, New Relic or other APM solutions
Work with product delivery teams to identify architectural issues and ensure timely and smooth delivery of features into operations.
Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization.
Excellent written and verbal communication skills in order to influence architectural and process level change in the organization.
BA/BS degree in Computer Science or related technical field, or equivalent practical experience.
Additional Qualifications:
Able to review Architecture and suggest reliability improvements
Guide and provide consultation on best practices for CI CD
Come up with automated approach to avoid manual KTLO and repeated tasks
Think innovatively and propose solutions for reliability on ever growing infra footprint
Flexible to learn and utilize internal / enterprise tooling in Walmart
Keyskills: Unix Automation Linux Networking Configuration management Database administration DNS Perl Customer experience Python