Service Reliability Engineer for Cloud Commerce Operating System (m/f/d)

About us

The Spryker Systems GmbH is a fast-growing technology company, offering leading manufacturers, brands and sellers of all industries a flexible commerce solution along all customer facing touchpoints. From online shop and mobile to voice, chat bot, blockchain and IoT use cases. Our modern offices are located in the German digital metropolis Berlin and Hamburg.
The international Spryker team is constantly working with new exciting customers, technologies and innovative approaches and is looking for talented employees, to join us revolutionizing the digital commerce world.

In a Nutshell

Are you an experienced Service Reliability Engineer with strong ownership skills? Do you think that cloud-native is not just technology, but a mindset? Do you want to put the latest technologies to use for hundreds of customers in different industries?

Join us as a Service Reliability Engineer to help us build the next generation of cloud and composition platforms to revolutionize the world of transactional business models.

We are open-minded, pragmatic, and agile above all. If you think you have the same attitude, join our Spryker Technology Team and help us to revolutionize the world of commerce.

Your challenges

  • Responding to production live site incidents accordingly to the established on-call schedule
  • Communicate with incident managers, customers or other stakeholders the status, progress, and forecast for solving acute problems
  • Communicate with the product development team to produce requirements for infrastructure, networking, and operations toolsets necessary to provision and maintain product lines
  • Solving day-to-day operational problems with the production environment
  • Controlling and ensuring SLAs, SLOs, RPOs, and RTOs
  • Automate common tasks and processes
  • Writing of documentation, articles, and How-Tos
  • Analyze problems in the full stack starting from the virtual hardware ending with specific applications: read and analyze logs and metrics to identify root causes and resolve them permanently or with a workaround
  • Ensure robust, stable, and secure back-end infrastructure to support the product portfolio.
  • Build deep monitoring coverage, implementing inside-out, outside-in, and machine learning-based monitors pushing toward early discovery and auto resolution to push the system toward 99.999 SLA
  • Design, build and release platform updates, strive to full automation, regression detection, etc.
  • Staying up to date with industry trends

Your profile

  • Computer Science, Software Engineering Degree or equivalent experience
  • Customer obsessed
  • Vast experience with AWS or other major public cloud platforms
  • Experience and willingness to participate in 24\7 on-call duty as part of the team, sharp thinking, and troubleshooting skills even during critical incidents
  • Experience working with high-scale complex, cloud-based production environments
  • Experience with managing incidents and full-stack problem analysis and solving
  • Experience with configuration management tools such as Terraform, Ansible
  • Good knowledge in NewRelic/Blackfire/Tideways/any APM, Production monitoring
  • Experience in automation and willingness to automate routine tasks
  • Excellent communication skills, both internally and externally
  • Experience in writing technical articles, How-Tos, and customer-facing communications.
  • Experience working with Git, branching, git-flow
  • Basic knowledge of relational database management systems
  • Upper-intermediate English