The Scientific Computing Core (SCC) is the technical backbone of the Flatiron Institute. SCC develops, deploys and maintains computational infrastructure from supercomputers to desktop workstations and provides resources, services, and expertise to the Institute's research community.
As a member of the SCC team, the HPC Systems Engineer is responsible for the deployment, operation and maintenance of the Flatiron Institute's Scientific computing infrastructure. The responsibilities include hardware monitoring and replacement for large scale HPC clusters with high-performance interconnects such as InfiniBand and Ethernet. Under the direction of the SCC director, they work with complex systems and networks in production and research environments, actively monitor clusters to ensure high system availability, and apply standard methods and procedures to resolve systems problems of moderate scope where analysis of situations or data requires a review of a variety of factors.
This is a full-time position based in Simons Foundation's offices in New York City and its Data Center location in Secaucus, NJ.
Visit the Simons Foundation career page at simonsfoundation.org/careers to learn more.
Support and troubleshoot cluster compute and network hardware for large Linux-based HPC systems connected via Ethernet, Infiniband or other networks.
Perform basic networking tasks to interconnect servers or components of clusters for communication.
Monitor state of HPC systems and initiate appropriate actions to maintain a high availability level of resources.
Fulfill technology support requests from Flatiron Institute's staff.
Manage issue resolution cycle including creation and tracking of tickets, communication with vendors, and generation of summary reports.
Work with hardware vendors to find resolution of hardware issues.
Provide guidance and assistance to the SCC team on escalated requests and contact vendors for support as necessary.
Assist the SCC Director with department reporting and operations
Keep up to date with the data center technologies.
Utilize the Bright Cluster Manager and other configuration management systems to maintain configuration state of clusters.
Track all changes to configuration with Git or similar version control system.
Write and execute shell scripts in support of systems management, log analysis and other system administration duties for multiple systems. Use devops tools to properly document said changes and commit them to revision control systems.
Travel required to our data center in Secaucus, NJ 1-2x/week.
Bachelor's degree in Information Technology, Computer Science or a related field, or equivalent professional experience and training.
Experience troubleshooting and repairing HPC / Data center hardware including compute and storage servers, network switches.
Knowledge of structured cabling components and concepts (patch panels, fiber, copper)
Experience with data center facilities components (Rack Power Distribution Units, UPS, single and 3-phase power, CDUs )
Basic understanding of Ethernet and IP networking.
Ability to write technical documentation in a clear and concise manner.
Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
Knowledge of the design, development and application of technology and systems to meet business needs.
General knowledge of other areas of IT.
Understanding of how system management actions affect users and dependent / related functions. Particular understanding of how these actions affect multi-tenant and multi-node environments that are characteristic of HPC systems.
Familiar with Linux systems administration, and Linux shell scripts.
Experience using issue-tracking systems to manage and document problem resolution.
Familiarity with cluster management software such as Bright Cluster manager preferred
Related Skills & Other Requirements
Detail-oriented and responsive to requests
Patient, friendly and willing to find reasonable solutions
Excellent interpersonal, verbal and written communication
Self-motivated, works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
Must be able to work in a data center, moving and working with rack-mounted machines.
COMPENSATION AND BENEFITS
The full-time annual compensation range for this position is $90,000 - $120,000, depending on experience.
In addition to competitive salaries, the Simons Foundation provides employees with an outstanding benefits package.
THE SIMONS FOUNDATION'S DIVERSITY COMMITMENT
Many of the greatest ideas and discoveries come from a diverse mix of minds, backgrounds and experiences, and we are committed to cultivating an inclusive work environment. The Simons Foundation actively seeks a diverse applicant pool and encourages candidates of all backgrounds to apply. We provide equal opportunities to all employees and applicants for employment without regard to race, religion, color, age, sex, national origin, sexual orientation, gender identity, genetic disposition, neurodiversity, disability, veteran status, or any other protected category under federal, state and local law.
To apply, visit: https://simonsfoundation.wd1.myworkdayjobs.com/en-US/simonsfoundationcareers/job/HPC-Systems-Engineer_R0001389-1