Engineering
SENIOR SITE RELIABILITY ENGINEER
INDIA
,
NEW YORK
,
MANCHESTER
Full - Time
,
Part - Time
,
Internship
,
Remote
We’re constantly working towards making BABVIP the best place to work, for everyone. We believe deeply that bringing together diversity of thoughts, perspectives and expression is key for building the best product for our equally diverse community all around the world. We celebrate uniqueness and whatever makes you, you and encourage everyone who wants to help us transform the way the world designs, to join us on this journey. We value all different types of experiences. If you don’t think you quite meet all of the qualifications, we’d still love to hear from you.
ABOUT US
At BABVIP, our mission is to democratize design and empower creativity for anyone and everyone, on every platform. Inspired by a team of talented thinkers, an amazing culture and a remarkable growth trajectory – we’re out to change the world, one design at a time.
Since launch in August 2013, we have grown exponentially, amassing over 60 million monthly active users across 190 different countries who have created more than 6 Billion designs. We are one of the world’s fastest-growing technology companies and we have only achieved about 1% of what we want to do.
You'll be joining BABVIP as one of the founding members of the Reliability Team, a team that sits in the Core Platform Group and Infrastructure Super group. The Reliability Team is responsible for ensuring that all of the resiliency measures that have been implemented work as expected, discovering gaps and working with the teams on fixing them, and developing processes, tools, automation, and libraries that help ensure the reliability of BABVIP.
This role is based in our Sydney office. However, it is remote-friendly for applicants physically based anywhere in Australia or in New Zealand.
WHAT YOU WILL DO
- As an individual contributor, design and implement processes, tools, automation, and libraries that service teams can use to improve the reliability of the services they own. For instance, adding a new long-awaited feature in our circuit breaker library.
- Introduce chaos engineering to BABVIP and conduct experiments to identify possible scenarios in which cascading failure might occur and to verify the reliability measures we introduce to prove this works as expected. E.g. discovering what will happen when this newly introduced service goes down? Does the fallback for this rare failure actually work?
- Work with product engineering teams to ensure reliability best practices and tools are rolled out in every service across the whole organization. It’s not enough to create a new throttling library, we want to make sure it’s successfully used in every service.
- Foster a culture within the Engineering org that puts reliability first and establish processes and policies that drive reliability within product engineering teams. This includes things like SLAs, error budgets, on-call response, incident resolution, observability best practices.
- Deep investigation into production incidents followed up by applying the learning to code.
- Researching, developing and justifying the best choices in the form of design docs for tools and processes that will shape the future of reliability at BABVIP.
- Propose new approaches and solutions to ensure we future-proof BABVIP’s distributed cloud infrastructure as we scale.
- Participating in design meetings, hiring interviews, and code reviews.
WE’D LOVE IT IF YOU HAVE
- Bachelor's degree in Computer Science, a related technical field involving software or systems engineering, or equivalent practical experience.
- Previous experience of working as a reliability/chaos engineer and/or strong knowledge of Google SRE
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and ability to debug, optimize code, and automate routine tasks.
- Solid understanding of resiliency techniques and patterns – load balancing, throttling, back pressure, circuit breaking, etc.;
- Experience working with micro service architectures in large distributed cloud environments (ideally AWS). We’re hosted on AWS and leverage the tools they provide as much as possible
- Experience with RPC Frameworks, Finagle, Thrift or grips; Understanding how services communicate with each other is crucial to find out where a failure can occur.
- Knowledge of networking protocols such as TCP, HTTP/2, Web Sockets, etc. The life of a request doesn’t start inside the backend web server, but rather in the browser of a user.
- Disciplined coding practices, experience with code reviews and pull requests and a creative and conceptual problem-solving approach.
- Strong communication and team collaboration skills, both written and verbal. As a reliability engineer, you will need to share the knowledge, communicate and coordinate changes across multiple service teams.
- Nice to have
- Experience working with a mainstream programming language Java/Python. However, our services and libraries are primarily written in Java 13, so Java is nice to have.
- Five-plus (5 ) years of commercial experience working with developing complex, distributed web applications.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- Understanding of Unix/Linux operating systems.
PERKS & BENEFITS
- Competitive salary, plus equity options
- Flexible daily working hours and open to remote, we value work-life balance
- In-house chefs that cook delicious breakfast and lunch for us each day
- Onsite Gym; Yoga Benefits
- Generous parental (including secondary) leave policy
- Pet-friendly offices
- Sponsored social clubs and team events
- Relocation budget for interstate or overseas individuals that legally qualify for visa sponsorship
- We make hiring decisions based on your experience, skills and passion. If you’re keen to apply and need reasonable adjustments or would like to note which pronouns you use at any point in the application or interview process, please let us know.