Tracxn - Technology - Sr Site Reliability Engineer ( Sr SRE ) ( 2 -6 Yrs )
Posted by tamal.chakraborty (@tamalchakraborty)
Tracxn is looking for experienced and motivated professionals to play a vital role in developing, scaling, and automating the IT infrastructure. As a Senior SRE, you will get hands-on experience in the latest technologies and skills like Ansible, AWS, Docker, HAProxy, Kafka, Zookeeper, Mongo, MySql, Redis, ELK Stack, Shell Script, Java, Python, etc. The incumbent in this role would demonstrate a strong focus on infrastructure management and large-scale production engineering and orchestration.
The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.
What we are looking for:
Experience in IaC tools (one of Ansible, Puppet, Chef, etc)
Experience with at least one of the cloud service providers (AWS, Azure, GCP)
Knows their way around Linux based operating systems
Expertise in at least one scripting/programming language
Experience in versioning tools like Git
Experience in configuring and managing enterprise monitoring and resource tracking systems
Experience with containers and orchestration (Docker, Kubernetes)
Experience in Infrastructure and configuration automation (Terraform, SaltStack)
Understanding of protocols/technologies like HTTP, SSL, LDAP, SSH, SAML, etc.
Systems fluency (Linux, storage, networking)
Experience with modern software components (Mongo, Redis, ElasticSearch, Kafka)
In-depth knowledge of operating systems (processes, threads, concurrency issues, locks, mutexes, semaphores, monitors and how they work)
Experience in software-automation production systems (like Jenkins)
Expertise in software development methodologies
Build and manage application deployment modules and pipelines
Designing, developing, securing, and optimizing our AWS Infrastructure and keeping up to date with the latest features/releases from AWS.
Developing & managing the infrastructure as code using Ansible
Run the production environment by monitoring availability and taking a holistic view of system health
Design metric collection and monitoring strategies and manage the alerting.
Drive RCA (Root Cause Analysis) for high-priority incidents and work with respective development teams on preventive measures.
Automate detection and resolution of recurring issues in the production environment and work towards building self-healing systems.
Please note - Should be remote working ready till pandemic subsides
Apply for this position
Login with Google or GitHub to see instructions on how to apply. Your identity will not be revealed to the employer.
It is NOT OK for recruiters, HR consultants, and other intermediaries to contact this employer