We are seeking a highly skilled Site Reliability Engineer (SRE) with strong experience in Kubernetes troubleshooting, incident response, and deep knowledge of monitoring and alerting systems, along with solid experience in CI/CD pipeline design and maintenance. You will play a key role in building and maintaining reliable infrastructure, enhancing observability, and ensuring uptime for mission-critical systems.
In this role, you will...- Diagnose and resolve issues in Kubernetes clusters, including deployments, pod failures, networking issues, and autoscaling.
- Lead incident management efforts including on-call response, root cause analysis, and continuous improvement of incident playbooks.
- Design and maintain monitoring, logging, and alerting systems using tools such as Prometheus, Grafana, and ELK (Elasticsearch, Logstash, Kibana).
- Set up and manage Kibana dashboards and maintain the ELK stack to ensure high availability and performance of logging infrastructure.
- Integrate metrics, logs, and traces into a unified observability platform.
- Build and maintain alerting pipelines to reduce noise and improve signal-to-noise ratio for production incidents.
- Contribute to infrastructure automation using tools like Terraform, Helm.
- Set up and support CI/CD pipelines for automated testing, deployment, and rollback across multiple environments.
- Participate in shift rotations and continuously improve observability and response systems.
You've Got What It Takes If You Have...- 2+ years in an SRE, DevOps, or Infrastructure Engineer role.
- Bachelor's degree in computer science, IT, or related technical field.
- Hands-on experience on AWS and GCP Cloud
- Deep hands-on experience with Kubernetes (EKS, AKS, GKE)
- Strong understanding of Linux internals, container orchestration, and microservice architecture.
- Hands-on experience with monitoring/logging tools:
- Prometheus, Grafana, InfluxDB
- ELK stack (Elasticsearch, Logstash, Kibana)
- Proficient in incident response and alerting tools (PagerDuty etc.).
- Basic knowledge of:
- Kafka - topic monitoring, consumer health
- ElastiCache / Redis - caching patterns and troubleshooting
- InfluxDB - time-series metrics storage
- Experience writing and maintaining automation scripts in Bash, Python, or Go.
#LI-Onsite