Lead Infrastructure DevOps Engineer

tombola


Date: 3 days ago
City: Sunderland, England
Contract type: Full time
Lead DevOps Engineer

Ready to lead the charge in seamless software delivery? Find your bounce at tombola!

At tombola, we build our own amazing games and platforms, and getting that cutting-edge software to our players reliably and efficiently is key. We're looking for a Lead DevOps Engineer to head up a team/function that designs, implements, and maintains the infrastructure, automation, and deployment processes that power our high-quality software delivery. You'll help bridge the gap between development and operations, ensuring scalable, reliable, and secure systems that always align with our business goals.

What will you be doing?

As a Lead DevOps Engineer, you'll be instrumental in shaping our infrastructure and delivery pipelines. You'll guide a team, driving best practices in automation, monitoring, and cloud management to keep tombola at the forefront of online gaming.

Key Accountabilities and Responsibilities:

Team Leadership and Management

  • Providing leadership, management, and development for your direct reports.
  • Achieving this through effective 1-to-1s, clear objective setting (OKRs), and performance management.
  • Making team goals clear and ensuring they align with our broader business objectives.
  • Collaborating with other teams and departments to achieve shared success.
  • Partnering with our People Partner for tech to build robust team management practices.

Continuous Integration and Continuous Deployment (CI/CD)

  • Develop and maintain CI/CD pipelines: Automating the process of software integration, testing, and deployment to speed up software delivery.
  • Integrate various tools: Ensure the development process integrates seamlessly with build and deployment tools (e.g., Octopus Deploy, GitHub, TeamCity).
  • Automate deployment processes: Make sure deployments are fully automated and can be performed with minimal manual intervention.

Infrastructure Management

  • Provisioning and managing infrastructure: Utilise Infrastructure as Code (IaC) tools like Terraform or CloudFormation to provision and manage infrastructure in AWS cloud environments.
  • Optimize resource usage: Ensure our infrastructure runs efficiently in terms of cost, resource allocation, and performance.
  • Scalability and availability: Ensure systems and applications are scalable and highly available by setting up robust monitoring, scaling, and failover mechanisms.

Monitoring and Incident Management

  • Set up monitoring tools: Implement tools like CloudWatch and Dynatrace to monitor system health, performance, and availability.
  • Respond to incidents: Be part of the team that responds quickly to system outages or issues, performing root cause analysis and implementing fixes or improvements.
  • Log management: Ensure logging is properly set up using tools like FluentD and Kibana, and use logs for troubleshooting and improving system reliability.

Automation

  • Automate manual tasks: Identify and automate repetitive tasks like environment setup, configuration management, and updates.
  • Script development: Write scripts to automate common operations, reducing manual intervention and potential errors.

Collaboration with Development and Operations Teams

  • Foster collaboration: Work closely with developers, system administrators, and other teams to align goals and requirements, ensuring seamless development and deployment processes.
  • Code reviews and quality: Collaborate with developers to ensure code meets operational standards and can be deployed reliably.

Security and Compliance

  • Ensure security in the pipeline: Integrate security practices into the development pipeline (DevSecOps), ensuring vulnerabilities are identified early.
  • Maintain compliance: Ensure infrastructure and processes comply with industry regulations and standards (e.g., GDPR, ISO, SOC 2).

Cloud Management

  • Cloud architecture and management: Design, implement, and maintain infrastructure on AWS.
  • Cost management: Monitor cloud costs and optimize resource utilization to control run-rate.

Performance and Reliability Optimization

  • Ensure optimal performance: Continuously assess and optimize system performance to handle load efficiently and minimize downtime.
  • Disaster recovery planning: Develop and test disaster recovery plans to ensure business continuity.

Version Control and Configuration Management

  • Version control systems: Use tools like Git to manage code changes and ensure proper branching, merging, and versioning.
  • Configuration management: Use tools like AWS Control Tower, SSM, and Config to maintain consistent environments across development, staging, and production.

Documentation

  • Maintain clear documentation: Document infrastructure, deployment processes, and operational procedures to ensure knowledge sharing across the team.

Performance Metrics and Reporting

  • Collect and analyze metrics: Gather system metrics and provide regular reports on system performance, uptime, and resource usage to stakeholders.
  • Optimization recommendations: Based on the metrics, suggest improvements to optimize the performance and cost-effectiveness of the system.

Continuous Improvement

  • Stay up-to-date with industry trends: Regularly evaluate new tools, technologies, and practices that could improve the development and operations process.
  • Implement best practices: Apply best practices for automation, system design, and security to improve the reliability, scalability, and efficiency of the infrastructure.
Post a CV