SiteOps Global Infrastructure Services Engineer
Company: Meta Inc
Location: Tallahassee
Posted on: May 26, 2023
|
|
Job Description:
Summary:
The Site Operations team is responsible for the delivery of data
center compute and storage at Meta, enabling our family of apps and
services to support a growing global community. We are seeking a
forward-thinking individual skilled across multiple disciplines to
lead global initiatives on this team. The Infrastructure Services
Engineer will take on complex technical problems, delivering
effective and impactful solutions, working, and communicating with
distributed teams and key stakeholders, across multiple
disciplines. This individual will partner with AI teams across Meta
and influence complex AI infrastructure technical strategy across
the globe and spanning multiple disciplines such as Hardware,
Software/Firmware, Networking and Power & Cooling. This role would
also be responsible for looking at the AI infrastructure strategy
from an operational perspective and providing guidance and
direction. The individual will be able to convey the technical AI
details and distill a high level strong message in a way that is
understood by all levels. Although the focus of this engineer will
be oriented towards AI infrastructure, the expectation would be
that they also be able to leverage their skills across other
infrastructure domains. The person should enjoy working in a
complex, highly technical environment where innovative design,
planning, execution and communication is key to success. The
candidate must be able to work collaboratively with cross
functional teams to bring innovative infrastructure designs and
initiatives from engineering concept to solution, implementing them
in new and operational data centers across the globe.
Required Skills:
SiteOps Global Infrastructure Services Engineer
Responsibilities:
Serve as a critical member of the global infrastructure engineering
team supporting and driving the operations of the AI
infrastructure/hardware platforms and associated new technologies
across Site Operations.
Drive complex AI/ML technical solutions globally and spanning
multiple disciplines such as Hardware, Software/Firmware,
Networking and Power & Cooling (all aspects of cooling
solutions).
Work closely with other Engineering team members to share best
practices and ensure appropriate feedback is given to
cross-functional teams in support of AI deployment and
operations.
Work with the AI cluster management team to provide serviceability
feedback on AI/ML production hardware, network, storage, and DC
design impacts.
Influence the higher stack requirements, translate those
requirements into impacts for AI zones (DC planning, buffer
management, regional fluidity IaaS, workload requirements, capacity
management and AI lifecycle.
Represents Site Operations in leading work to define and architect
new solutions on global initiatives, by working with key partner
teams across multiple disciplines.
Assemble and lead cognitively diverse teams to address complex
engineering challenges, requiring a deep technical expertise as
well as a broad understanding of Meta's overall infrastructure.
Acts as key Subject Matter Expert and mentor in the design,
operation, and troubleshooting of tools, technologies, and
processes utilized within the Site Operations environment.
Understand and assess risks and challenges associated with emerging
new hardware, data center and software technologies, and define
plans for how to address and mitigate these.
Effectively bridge between the logical and physical world, ensuring
a holistic understanding of the full infrastructure stack.
Acts as a global communication and advisory point of contact for
the design, implementation and delivery of projects that affect our
global data center and server fleet, and facilitates resolution of
issues drawing on local expertise and global support partners.
Ability to address issues that often are ambiguous and of global
nature, requiring leadership and collaboration across time zones,
teams and technical domains.
Leverages data-driven methodologies to understand a problem at the
onset, defining a plan and being able to measure progress
throughout a project. Provides data supplied narratives, and
ensures a strong focus on continuous improvement.
Builds and supports strong cross-functional connections with teams
across the globe and serves as an advocate for the Site Operations
Team with key partners, influencing policies and procedures to
improve global data center operations.
Ability to travel up to 20% to 30% required.
Minimum Qualifications:
Minimum Qualifications:
Experience building globally scalable solutions and translating
global strategic initiatives into local executable projects.
Experience building, operating and scaling with Linux or Unix
Operating systems.
BS, BEng or BA in technical field or commensurate experience.
Understanding of the full stack of infrastructure, with experience
building or operating logical infrastructure on top of a complex,
distributed physical infrastructure.
Knowledge of storage and AI/ML related services and general
knowledge of the hardware that supports them. Experience with
GPU/TPU based platform hardware that operates in AI/ML computing
clusters & workloads. Experience with AI algorithms and knowledge
of systems that can exploit them. Understanding the workload
characteristics of training and inference engines.
10+ years of technical experience, in a large-scale data center or
IT Infrastructure environment.
Preferred Qualifications:
Preferred Qualifications:
Strong knowledge of storage and AI/ML related services and the
hardware that supports them.
Coding or scripting experience such as Go, Bash, PHP, Python, or
SQL.
Strong communication skills and experience working in a highly
distributed environment, across teams/department boundaries.
Data Center Design and Expansion. Experience with high level data
center design, operations, basic electrical/mechanical
infrastructure, and scaling physical infrastructure.
Knowledge and experience with virtualization, containerization,
distributed systems, fault tolerance, and incident management.
Knowledge of the interdependencies of data center functions and
technologies including electrical, cooling, structured cabling,
security, network, server and storage systems.
Experience in providing technical guidance to external vendors and
partners.
Experience communicating the results of analysis and insights to
cross functional teams and influencing the strategy of these
teams.
Public Compensation:
$130,998/year to $183,000/year + bonus + equity + benefits
Industry: Internet
Equal Opportunity:
Meta is proud to be an Equal Employment Opportunity and Affirmative
Action employer. We do not discriminate based upon race, religion,
color, national origin, sex (including pregnancy, childbirth,
reproductive health decisions, or related medical conditions),
sexual orientation, gender identity, gender expression, age, status
as a protected veteran, status as an individual with a disability,
genetic information, political views or activity, or other
applicable legally protected characteristics. You may view our
Equal Employment Opportunity notice here. We also consider
qualified applicants with criminal histories, consistent with
applicable federal, state and local law. We may use your
information to maintain the safety and security of Meta, its
employees, and others as required or permitted by law. You may view
Meta's Pay Transparency Policy, Equal Employment Opportunity is the
Law notice, and Notice to Applicants for Employment and Employees
by clicking on their corresponding links. Additionally, Meta
participates in the E-Verify program in certain locations, as
required by law
Keywords: Meta Inc, Tallahassee , SiteOps Global Infrastructure Services Engineer, Engineering , Tallahassee, Florida
Click
here to apply!
|