SRE Engineer
■ Your Role and Responsibilities
• Use your on-call shift to prevent incidents from happening.
• Document actions taken, so your findings turn into repeatable actions–and then into automation.
• Design, build and maintain core infrastructure pieces that enable to support hundreds of thousands of concurrent users.
• Debug production issues across services and levels of the stack.
• Mentor Interns and Intermediate SREs in all areas and other SRE in their area of deep knowledge.
• Contribute improvements to the codebase to resolve issues
• Identify significant projects that result in substantial cost savings or revenue
• Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
• Proactively plan for efficiency and capacity to set clear requirements and reduce system resources usage to make the company assets cheaper to run for all our customers.
• Identify parts of the system that do not scale, apply immediate palliative measures and drive long term resolution of these incidents.
• Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
• Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA meetings, or Incident Reviews
• Perform and run blameless RCAs on incidents and outages aggressively looking for answers that can prevent the incident from ever happening again.
• Set example for team of SREs with positive and inclusive leadership and discussion on work.
• Be able to de-escalate conflicts inside the team
■ Work Location
・Tokyo, Japan
■ Experience and Qualifications
• 2+ years of work experience in the IT sector
• Experience as a Cloud, DevOps or Reliability engineer
• Work closely with engineering teams to create and improve containerized technologies
• Able to collaborate in a global team environment, actively engage subject matter experts, and follow through on commitments
• Strong problem solving (debugging) skills. The ability the dissect, divide and conquer platform problems and find root cause
• Knowledge of Microsoft Azure and / or AWS and / or GCP is a must (Azure preferred)
• Scripting knowledge in PowerShell and / or Python
• Version control experience (Git)
• Knowledge of container orchestration technologies (Kubernetes)
• Knowledge of container technologies (Docker)
• CI/CD knowledge
■ Additional Preferred Qualifications
• Knowledge of Azure Security center, policy and initiatives
• Knowledge of Azure Sentinel is a nice to have
• Experience operating in a Linux environment using the command line
• Experience using Azure DevOps or similar project management & pipeline tools
• Experience managing Logging and Monitoring system, knowledge of Azure Monitoring