An SRE engineer works in collaboration with Development , Test , and IT Operations to create and deploy scalable and reliable software systems (for on-prem deployments, cloud-based deployments (preferably Azure) with a strong hold on microservices deployments).
The SRE engineer is responsible for operating applications in production mission-critical systems and taking necessary actions to keep the site up and running.
Key focus areas of SRE:
Automation
Monitoring (Application Monitoring and Log monitoring) ex-ELK, EFK, OpenSearch, PLG, etc.
Tracing expertise in Managed k8s clusters (Preferably Azure)
Work blamelessly, always assuming the best intentions, and finding systemic causes together.
Create an SRE culture that reinforces our SRE principles. Enhance and maintain uptime.
Celebrate failure as an investment in reliability. Learn from each one with incident retrospectives .
Treat reliability as a feature. Put reliability goals in specifications right from the start.
Share information within the teams and organization, and work in collaboration with other teams.
Create on-call schedules that are empathetic and fair.
Monitor infrastructure using SRE tools and suggest tools as necessary.
Build monitoring alerts and incident response processes, using the monitoring systems (for alerting and dashboards).
Improve operational processes and team practices.
Coding infrastructure automation across the CI/CD pipeline.
As the solution scales, ensure reliability through designing, building, and maintaining the core infrastructure.
Demonstrate strong programming skills and thorough knowledge of systems.
Bring about cultural shifts to provide a foundation for process changes.
After incidents, document actions in runbooks to create automated solutions during incident response.
On-call rotation for incident response and proactive incident measures.
Administer production jobs and understand debugging info.
Drain traffic away from a cluster, block, or rate-limit unwanted traffic, bringing up additional serving capacity.
Roll back a bad software push, with minimal downtime.
Describe the architecture, various components, and dependencies of the services to Teams.
Provide visibility into the performance of the application and reduce the cost of failure to lower new feature cycle time.
Key Skills of an SRE engineer:
Familiarity with at least two coding/scripting languages (Python, Go, Java, Dotnet, C, C++, PowerShell, etc).
Cloud competency (Microsoft Azure is preferred).
Deep understanding of key Azure services like Azure Kubernetes Services (AKS) , Databricks, Data Factory, API Management, Functions Apps, Application gateway, etc.
CI/CD process and tools (Jenkins, GitHub Actions, Azure DevOps, etc.).
Should have experience with Service Mesh and message broker service (Kafka).
Deep understanding of containerization approach Docker , Helm , Kubernetes .
Experience in Infrastructure Monitoring (Datadog, Prometheus , Grafana , etc.)
Experience in Log and performance monitoring (Splunk, ELK , New Relic, etc.)
Deep understanding of Databases (SQL, NoSQL, Postgres Flexi Server, etc.)
Job Classification
Industry: Industrial Equipment / Machinery Functional Area: Industrial Equipment / Machinery Role Category: Software Development Role: Software Development - Other Employement Type: Full time