11 days ago - req19420
Site Reliability Engineer
Research & development
Computer science & software engineering
In a nutshell
Research & development
Computer science & software engineering
ASML is one of the world’s leading manufacturers of semiconductor-chip-making equipment. A majority of the world’s microchips receive their critical lithographic patterning in machines made by ASML. In addition ASML produces metrology tools and advanced applications to analyze and optimize the performance of the customer production process.
Troubleshoot short term problems and translate, develop into structural improvements on our distributed data and compute platform infrastructure. Be accurate, be precise and help drive up the aggregate availability of the installs of these distributed computing systems in Korea, Taiwan, Israel, China and the US (etc.). Be part of the compute platform that is one of the main pillars under the production of the next generation microchips of Apple, Samsung and many others.
Site Reliability Engineering is a new concept for ASML. You will be breaking new grounds. The SRE is expected to work for customer installs WW as well as on the test and integration systems running in Veldhoven.
The ‘Site’ where you are expected to drive the reliability upwards is one or more of many installations of the Virtual Computing Platform, the VCP, in the world. This platform under development shall be the foundation under the applications developed in house by other teams.
These applications take data from ASML scanners and ASML yield star equipment. They combine this data to real time production corrections and scanner process diagnostics. The corrections are sent back to the ASML production equipment. Failure of the platform would mean failure of the customers (tsmc, Samsung, Intel etc.) production facility.
Hence we have an uptime requirement of 4 nine’s. As a true distributed computing expert you will have your own view on such a baseline requirement but that might be a nice topic to discuss during an interview.
The Managed Operations (MO) department, active 24/7 in 3 geographical locations (time zones) is in between customer and the SRE team. As such monitoring and alert handling is not in the scope of the SRE at ASML at the moment. Where MO cannot address the problem the SRE comes in to support solving the problem at hand. It is the task of the SRE team to enable their MO counterpart to handle alerts without escalation by clear documentation and well defined automated corrective actions.
A great SRE will take the learning from the incident to improve the system in a next release. Via automation, automation and automation plus reduction of moving parts, upgrades of critical components or additional alerting the SRE tries to bring back the number of alerts back to ‘0’. The time that is saved is spent on adding features and capabilities to the platform to further drive the applications roadmap of ASML.
Responsibilities of the SRE:
-Create awareness in other teams about methods and procedures we use to help them to prevent repetitive help requests.
-Help application developers to understand the infrastructure / cluster / system
-“We are the team that is in charge of understanding & explaining how the system fits into the customer’s ecosystem”
-Share knowledge / mindset to other teams (dev/infra engineers)
-Cross functional, share knowledge between infra engineers
-Contribute towards building VCP as a Product which meets ASML standards of quality
-Increase stability and reliability of VCP by automated testing and automation
-Customer satisfaction and product reliability
-Improve the functionality and reliability of VCP
-Translate customer ecosystem needs to engineering deliverables
-Find the broken pieces of the puzzle at system/cluster level
-Combination of individual ‘stories’ in a complete book
-Make the VCP reliable by improving system resilience (bug-fixing and beyond)
-Resolve bugs in a sustaining way (implement regression test, design structural fixes)
-Ambassador of predictable component lifecycle management
-Technical roadmap maintenance (App life cycle management)
-Support feature and service request from the field
-Suggest improvements to our technical solutions and way of working, and implement them in alignment with your team and their stakeholders
Bachelor or Master in Computer Science
Highly valued qualifications & experiences:
-Experience with DC/OS
-Experience with new technology introduction @ zero downtime including data migration
-Fan of automatic testing and qualification, if can be part of CI/CD pipeline.
-Affinity to dig deep into the details of networking issues
-Available to work (remotely) outside regular office hours when it proves that attempt to build a fail-safe system was not yet successful. We really want this to be an exception, not a rule
Required qualifications & experiences
-Knowledge of distributed computing systems, practical experience (must!)
-Experienced in build and release infrastructure, Maven, Nexus, Bamboo, Github
-Familiar with at least one scripting language (Python)
-Experience with Ansible
-Problem solving / Go-fix mentality
-No is not an answer / Open to Challenges
-Think out of the box
-Look through the customer eyes
-Collaboration with stake holders
-Curiosity, understand how the system is working
-Broad Obsession about e.g, Java, Python, API, Ansible
-Ability to dive deep into a specific topic
-To build a more secure, faster, more reliable VCP
-Keeping in mind we are not Netflix, we tend to choose more proven technology as latest greatest in order to keep meeting the 4 nine’s.
-Think logically and use that ability to solve problems
-Be able to combine the individual elements and requests into a system design
-Share knowledge, work in pairs
-End-to-end knowledge for VCP support (skillset)
-Operations / supportmindset
Context of the position
You will be working at Business Line Applications. The BL Apps develops Analytics & Control solutions that improve the accuracy of performance metrics (such as overlay, focus, critical dimension) as measured on the end product of a fab process (wafers with chip structures). The foundation underneath these processing algorithms is a distributed computing platform assembled in house and as said, installed at ASML customers WW.
There are 3 – 4infra teams, 20-30 engineers, Product Owners and Scrum Masters working on the platform layers.
The application development teams that develop the business critical applications consist of 15-25 teams.
Keywords: Ansible, Kubernetes, DC/OS, D2IQ, Mesosphere, HDFS, MongoDB, Docker, UCR, Spring Boot, Splunk, Linux, HDP, Bamboo, Nexus, JIRA, Scrum, RHEV, RHEL, BACKUP