Site Reliability Engineer (SRE)
Company: Procurement Sciences
Location: Washington
Posted on: February 1, 2025
Job Description:
Company Overview: Procurement Sciences AI () is at the vanguard
of generative artificial intelligence, transforming the government
contracting sector as a Series A rocketship, proudly backed by
Battery Ventures, a top 1% global technology leading venture
capital firm. As a venture-backed B2B SaaS entity, we are dedicated
to revolutionizing federal, state, and local business approaches to
government contracting with disruptive AI capabilities. Our team is
committed to addressing customer pain points through an AI-first
strategy, ensuring our solutions are effective and ahead of the
curve. Our flagship platform, celebrated for its "Win More Bids"
value proposition, enhances revenue streams for our clients while
driving unparalleled operational efficiencies. By harnessing the
power of generative AI, tailored for the government contracting
domain, we offer a unique competitive advantage. Our collaboration
with Battery Ventures provides the resources and support to rapidly
scale our innovations, redefining success standards and promising a
quantum leap in value generation and operational excellence for our
clients.Job Description:We are looking for an experienced,
tenacious, and driven Site Reliability Engineer (SRE) to join our
team. The ideal candidate will be responsible for ensuring the
reliability, performance, and scalability of our systems. This role
will focus on performing root cause analysis, designing and
implementing automated testing, monitoring key service level
indicators (SLIs), and ensuring adherence to service level
agreements (SLAs) and service level objectives (SLOs). The
successful candidate will have a strong background in Kubernetes,
Helm, observability platforms, and cloud providers such as
Azure.Key Responsibilities:
- Perform root cause analysis to identify and resolve system or
application issues in a timely and effective manner, often in
communication with developers.
- Design and implement a broad range of automated tests to ensure
system reliability and performance.
- Build scalable and cost-effective observability patterns in
Datadog or other monitoring providers.
- Monitor and analyze SLIs to ensure adherence to SLAs and
SLOs.
- Collaborate with development and operations teams to improve
system reliability and developer experience (DevEx).
- Develop and maintain monitoring and alerting systems to
proactively address issues.
- Implement best practices for incident management and disaster
recovery.
- Respond to and manage incidents, performing post-mortem
analyses to prevent recurrence.
- Plan and implement capacity upgrades, ensuring scalability and
performance.
- Automate repetitive operational tasks and develop tools for
system automation.
- Define, monitor, and manage SLAs, ensuring service levels meet
or exceed expectations.
- Ensure systems comply with security and regulatory
requirements.
- Identify areas for continuous improvement in systems and
processes.
- Create and maintain documentation for systems, processes, and
incident responses.Technical Requirements:
- Proficient in Kubernetes, Helm, and troubleshooting in secure
environments with limited or no remote access.
- Expertise in observability and monitoring tools such as
Prometheus, Grafana, ELK Stack, or Datadog.
- Experience with cloud providers, particularly Azure and Azure
Gov.
- Strong understanding of microservices architecture, including
Postgres and AI systems.
- Expertise in automated testing frameworks and tools (e.g.,
integrated tests, synthetic tests, load testing, etc.).
- Experience with monitoring and analytics tools to track SLIs,
SLAs, and SLOs.
- Excellent problem-solving skills and attention to detail.
Tenacious attitude.
- Strong communication skills, with the ability to work
effectively in a collaborative environment.
- Proficiency in programming languages such as TypeScript and
Python.
- Strong scripting skills in Bash, PowerShell, or similar
languages.
- Experience with Infrastructure as Code (IaC) tools like Azure
Bicep, AWS CDK, or Terraform.
- Understanding of networking principles and experience with
network troubleshooting.
- Strong communication and collaboration skills, with the ability
to work effectively with both technical and non-technical
personnel.Preferred Qualifications:
- GovCon experience and/or security clearance.
- Knowledge of GitOps principles. Familiarity with FluxCD.
- Experience designing and building CI/CD pipelines.
- Experience with other cloud providers (AWS, AWS GovCloud) is a
plus.
- Familiarity with security and compliance standards in cloud
environments.
- Prior experience in a similar role within a fast-paced, dynamic
environment.
- Deep experience of operationalizing new development workloads
across different teams.Location: Washington, D.C./North
Virginia
#J-18808-Ljbffr
Keywords: Procurement Sciences, Tuckahoe , Site Reliability Engineer (SRE), Professions , Washington, Virginia
Didn't find what you're looking for? Search again!
Loading more jobs...