Build a distributed log analysis system using ELK

1243 Read for 3 minutes

The logs of distributed system are scattered on each server, which is very unfavorable to monitoring and troubleshooting. We built a complete set of log collection, analysis and display system based on ELK.

Architecture diagram


Main idea

1. Organize Rails logs

We are most concerned with the Rails access logs, but the format of the Rails logs themselves is problematic, for example

Started GET "/" for 10.1.1.11 at 2017-07-19 17:21:43 +0800 Cannot render console from 10.1.1.11! Started get "/" for 10.1.1.11 at 2017-07-19 17:21:43 +0800 cannot render console from 10.1.1.11! Allowed networks: 127.0.0.1, ::1, 127.0.0.0/127.255.255.255 Processing by Rails: : WelcomeController# index as the HTML Rendering / home/vagrant /. RVM/gems/ruby-2.4.0@community-2.4 / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/index. The HTML. Erb Rendered / home/vagrant /. RVM/gems/ruby-2.4.0@community-2.4 / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/index. The HTML. Erb (2.5ms) Completed 200 OK in 184ms (Views: 10.9ms)

As you can see, the logs of a single request are scattered across multiple lines, and in the case of concurrency, the logs of different requests will be intertwined. To solve this problem, we use logstasher to generate a new log in JSON format

{" identifier ":"/home/vagrant /. RVM/gems/ruby-2.4.0@community-2.4 / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/in Dex. HTML. Erb, "" layout", null, "name" : "render_template. Action_view," "transaction_id 35 c707dd9d4cd1a79f37" : ""," duration ": 2.34 , "request_id" : "bc291df8-8681-47 d3-8 e10 - bd5d93a021a0", "source" : "unknown", "tags" : [], "@ timestamp" : "the 2017-07-19 T09:29:05. 969 z ","@version":"1"} {" method ":" GET ", "path" : "/", "format" : "HTML", "controller" : "rails/welcome", "action" : "index", "status" : 200, "duration" : 146.71, "The view" : 5.5, "IP" : "10.1.1.11", "route" : "rails/welcome# index", "request_id" : "bc291df8-8681-47 d3-8 e10 - bd5d93a021a0", "source" :" Unknown ", "tags" : [] "request", "@ timestamp" : "the 2017-07-19 T09:29:05. 970 z", "@ version" : "1"}

2. Use Logstash to collect logs

Logstash uses a configuration file to describe where the data comes from, how to process it, and where to output the whole process, corresponding to the three concepts of input,filter, and output respectively.

Let's start with a simple configuration to verify correctness

input { file { path => "/home/vagrant/blog/log/logstash_development.log" start_position => beginning ignore_older => 0 }  } output { stdout {} }

In this configuration, we read from the log file generated in the previous step and output it to stdout with the following results

2017-07-19T09.59:01.520Z precise64 {" method ":" GET ", "path" : "/", "format" : "HTML", "controller" : "rails/welcome", "action" : "index", "status" : 200, "duration" : 4.85, "v Iew: 3.28 ", "IP" : "10.1.1.11", "route" : "rails/welcome# index", "request_id" : "27 b8e5a5 - dd1d - 4957-9 d50888 c91-435347", "source" : "u Nknown ", "tags" : [] "request", "@ timestamp" : "the 2017-07-19 T09:59:01. 030 z", "@ version" : "1"}

Then, modify the Logstash configuration file to change output to Elasticsearch

input {
  file {
    path => "/vagrant/blog/log/logstash_development.log"
      start_position => beginning
      ignore_older => 0
    }
}

output {
  elasticsearch {
    hosts => [ "localhost:9200" ]
    user => 'xxx'
    password => 'xxx'
  }
}

As you can see, the readability of the entire configuration file is very high, and the input description is that the input source is our organized log file and output to Elasticsearch.

Then you can use Kibanana for log analysis.

3. Some Kibanana practices

Based on Kibanana, we can customize Elasticsearch searches to query some very valuable data

  • Example Query the request status of an interface
  • Example Query the super slow interface that takes longer than 500ms
  • Query the interface that the line reports 500
  • Statistics on high-frequency interfaces......

4.Future

With the data provided by ELK, we have been able to conveniently complete the error troubleshooting and high-frequency interface statistics in distributed cases, providing guidance for the next step of optimization. We no longer have to guess which 20% of hot spots are based on business logic, but have real data support.

Step 5: Questions

Of course, in the use of the process also encountered some problems. During the event, Elasticsearch ate a lot of memory and brought down two machines.