Manager, Incident Response
Summary:
The Manager of Incident Response is an experienced and dynamic leader who will manage the Gurugram incident response team and oversee the seamless operation, performance monitoring, and site reliability of Cosm venues, B2B customer installations, live video productions as well as our Cosm's cloud-based applications and services. The ideal candidate will possess a strong background in operations management coupled with a deep understanding of on-premises and cloud-based performance monitoring tools and strategies. This role requires strategic thinking, leadership, and hands-on technical expertise to ensure the reliability, scalability, and optimal performance of all Cosm's technologies to ensure a seamless experience for all our customers around the globe. This manager will lead the local incident response team, participate in monitoring shifts, drive operational excellence, and contribute to the development of incident response strategies and frameworks.
Responsibilities:
- Oversee and manage the incident response operations within the Gurugram team, ensuring prompt resolution of high-impact incidents and escalations.
- Lead and support the local incident response team through effective incident diagnosis, prioritization, and documentation.
- Participate on monitoring shifts for all venue products and systems, ensuring the health and performance of our customer facing technologies.
- Serve as the primary point of contact for high-severity incidents and escalations, coordinating with global teams as necessary.
- Collaborate with engineering and cross-functional teams to implement and follow up on incident remediation efforts and RCAs.
- Support the DevOps lifecycle by providing product data to product owners and engineering teams to empower teams with the information they need to enhance product reliability and serviceability.
- Utilize industry-leading tools and methodologies to monitor, analyse, and report on infrastructure, service, product, and application performance, identifying and addressing potential bottlenecks or issues before they impact end-users.
- Drive the implementation and adoption of best practices for infrastructure, systems, live video production, and cloud-based application performance monitoring, ensuring alignment with organizational goals and industry standards.
- Develop and deliver comprehensive incident and operational reports to stakeholders, highlighting key metrics and areas for improvement.
- Coordinate with various teams on venue upgrades and planned outages to ensure seamless execution.
- Work with Customer Service teams to address incidents affecting the customer experience and collaborate with engineering teams on investigation and remediation efforts.
- Mentor and guide team members, fostering a collaborative and high-performance work environment.
- Drive continuous improvement in incident response processes, tools, and procedures, leveraging feedback from team members and field services.
- Provide support during nights and weekends as needed for high-priority business events.
Experience:
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 8+ years of experience in incident management, operations center, or a similar leadership role.
- Proven expertise in incident management tools and systems (e.g., Grafana, ServiceNow).
- In-depth knowledge of cloud technologies (e.g., AWS) and experience with cloud-based monitoring tools (e.g., CloudWatch, Prometheus, Grafana).
- Strong understanding of DevOps practices, automation, and CI/CD pipelines.
- Proficiency in analyzing performance data, troubleshooting issues, and collaborating on solutions to optimize infrastructure, systems, or application performance.
- Experience in crafting PromQL and LogQL queries for incident investigation, dashboard creation and alert tuning efforts.
- Experience in managing and supporting complex infrastructure and SaaS applications.
- Strong analytical, communication, and problem-solving skills.
- Demonstrated ability to lead teams effectively, with experience in high-pressure situations and incident management.
- Knowledge of ITIL or similar incident/service management frameworks.
- Previous experience managing a 24/7 operations center or equivalent environment.
- Certifications in project management, cloud platforms, information security, networking, and application performance monitoring tools are a plus.
Work Environment:
- Flexibility to work overtime and weekends as required by the operational needs of the business.
- Availability for on-call rotation, including nights and weekends, to support critical business events.
Otros detalles
- Grupo de puestos India
- Tipo de pago Salario
- Gurgaon, Haryana, India