DevOps / SRE
Platform.sh is a remote-first global workforce building a better cloud platform to create, manage and responsibly scale web applications.
As a collective with diverse backgrounds, we work together to test, innovate, and challenge one another, finding new ways to reimagine digital experiences. We’re here to help our customers thrive.
Bring your experience to our team, and help us build a better way.
We're looking for an Operations and Service Reliability Engineer with a taste for Python and Go, great Linux system understanding, and a real hunger for the challenges of building robust, distributed systems.
Platform.sh is a PaaS shrouded in a lot of black magic (we can consistently clone a whole running cluster, with its state, databases, indexes in a matter of seconds). We want to get this down to the hundreds of milliseconds domain. Interested? There is more...
Our external API is pure Hypermedia REST + oAuth on top of Pyramid. It mechanizes the Git layer and needs more features.
We can consistently generate from the same manifest a Docker container, an LXC one, or VM disk images (AWS, Azure, OpenStack), we want more targets.
We probably have the highest industry container density. We need to get it higher.
We support any Python, Ruby, NodeJS or PHP, Java and .NET, time to roll-out Elixir, of course, Elixir (and Rust. We need Rust).
Directly reporting to one of our Directors for the Operations Infrastructure Department and in close interaction with our Engineering and Customer Success teams, you will be responsible for:
- Cloud operations: configure clusters, deploy stuff, follow-up on alerts, help customer support debug issues
- Automating all of the above so they can instead drink margaritas (or non-alcoholic beverages, of course)
- Creating systems, tools & processes that will enhance our support and operations efficiency
- Improving service quality, discipline and reliability throughout lifecycle
- Monitoring operating objectives, streamline and automate intervention
- Continuous learning from Operations experience, modeled as software
What you can expect:
- Development week: writing the tools and automation to make our infrastructure run smoothly using Puppet, Go, Python, and more. We often find ourselves working on monitoring, self healing & upgrades.
- Deploy week: there’s a constant need for increased capacity, and so we also handle the creation of new infrastructure, wherever and whenever needed.
- Escalation week: when there's a problem too tough for our support team to solve, our team is called upon to provide assistance.
- On-Call week: we dedicate one team member at a time to handle critical monitoring alerts from the infrastructure.
- Proven successful experience in an operations role
- The ability to successfully manage cloud-based infrastructure for a fast growing organization
- Has experience with containerization technologies
- Previous exposure to cloud services such as AWS, Azure, GCP, etc
- An understanding how an OS works, knows networking, how git works, and the constraints of a distributed system,
- Puppet experience
- Proficiency in Python (Golang a plus)
Nice to have:
- Knowledge of Magento Ecommerce, Symfony, Drupal, eZ Platform, or Typo3
- Ability to cover weekends