Engineer – SRE (Site Reliability Engineering) is responsible for helping to set up observability and alerting for NOC IT systems are up and running 24/7 through developing tools and process automation of AWS Cloud services and infrastructure with collaborating with other leaders and architects.
Job Responsibilities:
System Health and Performance Monitoring:
Regularly review system health metrics and logs to proactively identify potential issues.
Assist in the development and maintenance of performance benchmarks for NOC IT systems.
Continuous Improvement:
Participate in post-incident reviews to identify areas for improvement and implement changes.
Contribute to the development of best practices for system reliability and performance.
Documentation and Reporting:
Maintain accurate documentation of system configurations, processes, and incident resolutions.
Prepare regular reports on system performance and reliability metrics for management review.
Collaboration and Communication:
Work closely with other engineering teams to ensure seamless integration and operation of IT systems.
Communicate effectively with stakeholders to provide updates on system status and ongoing projects.
Minimum Requirements:
Bachelor’s Degree in a quantitative field (statistics, engineering, business analytics, information systems, aviation management or related degree)
2+ years of related experience or successful completion of United DT Early Career Digital Leadership Program (ECDLP)
Experience working application performance monitoring, CI/CD pipelines and AWS cloud infrastrtucture – specially EKS and ECS
Ability to communicate complex quantitative concepts in a clear, precise and actionable manner
Good writing, communication and presentation skills
Proven proficiency with Microsoft Excel and PowerPoint
This position does not offer sponsorship for employment-based visas (such as H-1B or STEM/OPT). Candidates must be authorized to work in the United States without requiring visa sponsorship.
Preferred Qualifications:
Master’s Degree