
Senior Site Reliability Engineer
- Λεμεσός
- Μόνιμη
- Πλήρης Απασχόληση
- Design, implement, and maintain monitoring, alerting, and logging systems (Prometheus, VictoriaMetrics, Grafana, OpenSearch, Dynatrace).
- Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and manage error budgets to measure and improve system reliability.
- Automate repetitive tasks and build self-healing infrastructure using scripting (Bash, Python) and infrastructure-as-code tools (Terraform, Terragrunt).
- Ensure Kubernetes (EKS) cluster reliability through health checks, graceful shutdowns, rolling updates, and autoscaling.
- Develop and maintain CI/CD pipelines using GitLab and Helm charts.
- Lead incident response, conduct blameless postmortems, and implement preventive measures.
- Document operational procedures, runbooks, and observability logic; train internal teams on best practices.
- Participate in 24/7 on-call rotations to maintain service availability.
- 3+ years of experience working with Linux and AWS environments (AWS certifications a plus).
- Hands-on experience with observability tools: Prometheus, Grafana, OpenSearch/ELK, VictoriaMetrics, Dynatrace.
- Familiarity with messaging and database technologies such as Kafka, RabbitMQ, PostgreSQL, Cassandra, Redis, Elasticsearch.
- Strong skills in containerization and orchestration: Docker, Kubernetes (EKS), Helm.
- Proficient scripting skills in Bash and Python; experience with Terraform and Terragrunt for infrastructure automation.
- Solid understanding of CI/CD processes, preferably with GitLab.
- Knowledge of SRE principles including SLIs/SLOs, error budgets, capacity planning, and incident management.
- Excellent communication skills and ability to collaborate across teams.
- English proficiency at intermediate level or higher.