The Ops Community ⚙️

Cover image for DevOps Guru: A new way (AIOps) to fix your AWS Infrastructure issues
Jatin Mehrotra
Jatin Mehrotra

Posted on • Originally published at dev.classmethod.jp on

DevOps Guru: A new way (AIOps) to fix your AWS Infrastructure issues

An application deployed in production always has some kind of monitoring and observability to track its performance, operational issues, latency, downtime, error rates, outages, service disruptions, infrastructure code, config changes and ...the list can go on.

Often monitoring and observability operations do not lead to an efficient outcome because there is a lot of noise and false negatives due to lack of information, alarm fatigues etc.

For a DevOps or SRE, this results in spending days of time and effort to detect, debug and resolve operational issues, which is definitely not good from a business critical application and customer point of view.

DevOps Guru is an ML-powered service to address the above-mentioned issues and save your precious hours of detecting and resolving operational issues leading to enhanced application availability and reliability.*

Prequisites

  • NO ML KNOWLEDGE/EXPERTISE IS REQUIRED. (Trust me when I say this, AWS has made it so easy for us )

What is DevOps Guru?

How does DevOps Guru Work

  • A service that sends you alerts and recommendations whenever there is an anomaly in the operating patterns that can cause application downtime or service disruptions.
  • DevOps Guru uses ML models learned from more than 20 years of operational expertise in building, scaling, and maintaining universally available applications for Amazon.com.. This leads to Reactive and Proactive insights, so you can identify operational issues long before they impact your customers.
  • Reactive Insights: Recommendations to address issues that are happening now.
  • Proactive Insights: Recommendations that address issues that DevOps Guru predicts will occur in the future (power of ML).

Why to use DevOps Guru?

  • Saves tons of time, reduces application downtime and ultimately leads to happy customers because it pinpoints the problems with its summary about why the issue took place and how to fix it through an actionable recommendation.

How does DevOps Guru work?

  • Baseline Creation: It established a baseline which is treated as "normal". For this process, it consumes metrics like latency, error rate, and request rates.
  • Using a pre-trained ML model, when it identifies an anomaly, it generates alerts with its recommendations for its remediation.

Use Cases for DevOps Guru

insights

How to Enable and disable DevOps Guru and check estimation?

  • To enable, Just click Enable. Yes, that's how easy it is and on top of that, you can configure the SNS topic to send alerts and take action :)

enable

  • To disable, Again Just click None on resources to analyse, you can decide which resources to analyse based on tags, or cFn stacks or all resources :)

disable

  • To estimate cost, you can again choose, either on cFn stack, tags or all resources.

cost-estimator

Demo Time for DevOps Guru

To illustrate the power of DevOps Guru, we will work with RDS and serverless application which is already deployed using this workshop from AWS

  • Go to cloud 9, to access the env for this blog.

cloud 9

DevOps Guru for RDS

  • Scenario 1: There is a mission-critical production application which uses Aurora relational database with the PostgreSQL engine. The application has been updated with a new feature and users are complaining about slow performance and time out.

  • Simulation of Scenario: We will try to increase the read workload on DB.

PGPASSWORD=${PGPASSWORD3} psql -h $PGHOST3 -U $PGUSER3 -p $PGPORT3 -c "CREATE EXTENSION pg_stat_statements;"

PGPASSWORD=${PGPASSWORD3} pgbench -h $PGHOST3 -U $PGUSER3 -p $PGPORT3 -c 10 -T 1800 -j 10 $PGDATABASE3

Enter fullscreen mode Exit fullscreen mode
  • These commands are using 10 connections with 10 threads.

  • After a certain time, we can see insights from DevOps Guru.

  • DevOps Guru Insights: Select the insight and view its details. Here we can an overview about the insight, information about the metrics as well as graphed anomalies.

anomaly page

aggegated metrics

  • It will show which metric is affected in this case ( Average active sessions ( AAS) ), DevOps Guru also shows analysis and recommendations which pinpoints the issue.

metric graph

  • It also tells the reason and recommendation behind this anomaly to give better insights about the cause.

index issue

index recommendation

  • The query requires aid ( an index ), which is not present in the table (all thanks to DevOps Guru). The takeaway here is that a missing index can ruin application performance and keep the entire system busy.

  • Scenario 2: Team member took a coffee break but forgot to commit/close the session on RDS which he has opened which leads to an exclusive lock on the table.

  • Simulation of the Scenario: We will try to use multiple sessions competing for the same "locked" record and of course due to ACID properties of the transaction, they must wait for each other.

  • Team member work

cat > workload-lock-txn2.sql << ENDOFFILE

begin;
lock table hr.job_history in exclusive mode;


ENDOFFILE

Enter fullscreen mode Exit fullscreen mode
  • Our workload
cat > workload-lock-txn1.sql << ENDOFFILE

\setrandom counter 2 200
\setrandom sleeptime 1 5

begin;
select start_date,end_date from hr.job_history where employee_id=5550 FOR UPDATE;
SELECT pg_sleep(:sleeptime);
update hr.job_history set start_date=(select current_date - :counter days), end_date=current_date where employee_id=5550 ;

ENDOFFILE

Enter fullscreen mode Exit fullscreen mode
  • Running the above workload will create a lock condition.
PGPASSWORD=${PGPASSWORD2} pgbench -h $PGHOST2 -U $PGUSER2 -p $PGPORT2 -d $PGDATABASE2 -f workload-lock-txn1.sql -f workload-lock-txn2.sql -c 120 -T 1800 -j 30 -n

Enter fullscreen mode Exit fullscreen mode
  • Just like the previous insight, DevOps Guru is smart enough to find the cause and suggest a reason behind the anomaly.

lock issue

lock recommendation

DevOps Guru For Serverless

  • Scenario 1: As business needs are changing, the production application which was running on virtual machines have been moved to AWS serverless. In the production DynamoDB, there is a misconfiguration of not enabling point-in-time recovery.

  • DevOps Guru not only sends reactive insights but it also provides proactive insights ( takes about 12 hours once enabled) which we can't foresee in the status quo.

  • Check the proactive insights tab, to see the insight recommendation.

proactive recommendation

  • Scenario 2: We will reduce Read Capacity from 5 to 1 of DynamoDB and update via Cloudformation.

misconfig

cd ~/environment/amazon-devopsguru-samples/generate-devopsguru-insights
aws cloudformation update-stack --stack-name myServerless-Stack \
    --template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \
    --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
Enter fullscreen mode Exit fullscreen mode
  • Now we will try to inject an HTTP request to API using 4 instances of the python script. After some time we can observe 502 internal server errors.
python sendAPIRequest.py &amp; python sendAPIRequest.py &amp; python sendAPIRequest.py &amp; python sendAPIRequest.py
Enter fullscreen mode Exit fullscreen mode

502 errors

  • DevOps Guru monitors and reports this anomaly with the cause of the anomaly and applicable recommendations.

  • We can also confirm the relevant event which caused this issue.

event history

dynamodb config recommendation

graph anomaly

From DevOps Perspective

  • DevOps guru removes the need for manual monitoring of applications and infrastructure on AWS.

  • With the use of pre-trained ML models, it indeed fulfils the essential requirement of a production environment where it generates insights and recommendations for real-time application and infra-related issues as well as for future operational issues which cannot be anticipated at this time.

I can definitely say it will surely help to achieve the high availability and operational excellence for AWS infrastructure that you promise to your customers.

Till then, Happy Learning!

Would love to know your opinions on this, let's connect! (@imjatinmehrotra, Linkedin )

Oldest comments (0)