Vacancies

Onsite Interview Site Reliability Engineer, AI & Agentic Systems

Employer logo
Onsite Interview Site Reliability Engineer, AI & Agentic Systems
ShineBask Technologies LLC

Country flag
Plano, Texas, United States
Classification symbol Information Technology
Classification symbol Engineering
H-1B
OPT
All other/unspecified
Job posted on June 16, 2026
APPLY NOW
Job Description:
On-Site Interview: Site Reliability Engineer, AI & Agentic Systems _ Plano, TX Location: Plano, TX 9-month contract Work Schedule: Hybrid role requiring 3 days onsite per week in Plano, TX. Candidates must be able to attend an onsite interview in Plano, TX. Please target local candidates in the Dallas/DFW area. Interview Process: 2 Steps
  • Virtual Interview
  • Onsite Interview in Plano, TX

Work Authorization: Candidates must be authorized to work in the U.S. and can be H1-B, EAD-OPT, EAD-CPT, TN Visa, EAD-Green Card, Green Card, or U.S. Citizen.

As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering.The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load
.
A DAY IN THE LI
FEIn this role, you wil
  • l…Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradati
  • onLead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-u
  • psBuild and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc
  • .)Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerti
  • ngDesign, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planni
  • ngIntegrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platfor
  • msContribute to evolving SRE standards, tooling, operational processes, and knowledge ba


se
Reliability Engineering & Production Owners
  • hipOwn end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradat
  • ionLead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-
  • upsDefine, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outco
  • mesDrive proactive reliability improvements based on operational insights, failure mode analysis, and capacity plann
  • ingParticipate in on-call rotations and take real-time ownership during production incide


nts
Platform & Automation Enginee
  • ringBuild and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, e
  • tc.)Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and aler
  • tingCreate automation to eliminate manual operational tasks and reduce Mean Time to Recovery (M
  • TTR)Implement self-healing mechanisms, automated remediation workflows, and runbook automa
  • tionManage and optimize API lifecycle and traffic management using Gravitee API Gat
  • ewayDesign and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temp
  • oralAdminister and tune PostgreSQL databases for reliability, performance, and high availabi
  • lityPartner with application and platform teams to improve service operability, deployment safety, and change manage


ment
Performance Testing & Load Engine
  • eringDesign, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity pla
  • nningBuild and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Gener
  • ator)Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommenda
  • tionsIntegrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing prac
  • ticesEstablish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user jou
  • rneysCollaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utiliz

ationAI / Agentic Engineering (Azure F
  • ocus)Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-m
  • akingBuild intelligent automation for operational use cases, inclu
  • ding:Incident triage, enrichment, and automated escal
  • ationAlert correlation, deduplication, and noise redu
  • ctionAutomated diagnosis and remediation of recurring fai
  • luresLeverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive ins
  • ightsIntegrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management plat
  • formsEnsure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit t


rails
Collaboration & Technical Lead
  • ershipAct as a reliability, performance, and automation champion across engineering
  • teamsMentor junior SREs and influence adoption of best practices in reliability, observability, and performance engin
  • eeringContribute to evolving SRE standards, tooling, operational processes, and knowledg
  • e baseParticipate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, perfor


mance)
Core SRE
  • Skills5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineerin
  • g rolesStrong hands-on experience in production troubleshooting of distributed systems a
  • t scaleSolid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance
  • tuningDeep hands-on experience with Microsoft Azure (compute, networking, storage, managed service
  • s, AKS)Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices archit
  • ecturesProficiency in one or more programming languages: Python, Go, Java, or equ
  • ivalentExperience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates,


Bicep)
Observability & Mo
  • nitoringHands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, a
  • nd TempoExperience with alerting strategies, SLI/SLO-based monitoring, and on-call incident ma


nagement
Performance Testing & Load En
  • gineeringProven experience designing and executing performance and load testing for large-scale distributed app
  • licationsHands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result
  • analysisStrong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity
  • planningAbility to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineerin
  • g actionsExperience integrating performance tests into automated CI/CD

pipelinesPlatform & M
  • iddlewareExperience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle g
  • overnanceHands-on experience with Temporal for workflow orchestration, durable execution, and distributed task m
  • anagementStrong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performan

ce tuningAI / Agenti
  • c SystemsHands-on experience building or integrating AI-powered automation in production env
  • ironmentsExperience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent a
  • ssistantsFamiliarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services,
  • Azure ML)Understanding of reliability, safety, observability, and operational challenges of AI systems in p


roduction
APPLY NOW