Recently I’ve been tasked with setting up VxRail monitoring for a Dell EMC VxRail Appliance. Even though I’ve succeeded setting this up, it was has not been a smooth sail. Therefore, I’ve written this blog to help you setup monitoring for VxRail.
Setup VxRail monitoring (EMC does not want you to)
Since VxRail is initially setup by the vendor I had to get acquainted with the product first. The first thing you should probably do after taking a product into management is setup monitoring. Hence, VxRail monitoring was the first bullet on my task-list. After consulting with support we’ve been told no external monitoring is supported. The system does not support SNMP, nor can it send out e-mail alerts when a critical event is found. In conclusion of our support conversation, the only option is a dial-home to Dell/EMC using EMC Secure Remote Services (ESRS).
Whenever critical events occur we don’t want to wait for EMC to contact us, we want to be aware right away. Hence I setup VxRail monitoring without official vendor support, but I had to build it myself.
Another drawback of VxRail is that it does not integrate with an existing ESRS cluster. Because we already had one in place for other EMC products, we were not very keen on adding more administrative load. Also, the ESRS that ships with VxRail doesn’t support a proxy. As a result, you will not be able to setup monitoring if you lack direct internet access. Update: So, after providing feedback EMC has fixed this in VxRail Appliance software 4.0.300. In this update EMC added support for external ESRS. Consequently, we are now able to point multiple appliances to a single ESRS cluster, hooray!
Monitoring over vCenter
VxRail monitoring of the integrated vSAN and host hardware sensors is available through vCenter SDK (just like any other hardware running vSphere). Unfortunately this does not cover any of the VxRail appliance internal events. Therefore, this is not a fully satisfying solution. As most monitoring solutions support vSphere monitoring out of the box I will not discuss this in detail, but recommend to setup monitoring of the Host Health Status;
What if I want to setup VxRail monitoring myself
As there is no direct support from EMC for setting this up we decided to take matters into our own hands. The VxRail Appliance (VXRA) ships with an internal PostgreSQL database to holds the health events.
Let’s start exploring the database
1#List databases: 2psql -U postgres -l 3 4#list tables 5psql -U postgres -d mysticmanager -c 6 7#list columns 8psql -U postgres \ 9 -d mysticmanager \ 10 -c "select * from event_code where false" 11 12psql -U postgres \ 13 -d mysticmanager \ 14 -c "select * from mystic_event where false" 15 16#find criticalities 17psql -U postgres \ 18 -d mysticmanager \ 19 -c "SELECT DISTINCT severity from event_code" 20 21severity 22----------- 233Info 242Warning 251Error 260Critical 27
Listing the tables shows us two things of potential interest. Tables, mystic_event and event_code are the tables we need. We are not interested in events of severity ‘3Info’.
We can put this all together as follows
1SELECT count(*) 2FROM event_code AS ec 3INNER JOIN mystic_event AS me ON ec.code = me.code 4WHERE ec.severity ~ '[0-2].*' 5 AND me.unread = 't';
This is a SQL query that, using a regex, will get us all the events we are interested in. This query will return all unread events with severity
Integrating this in your monitoring solution
Most monitoring tools support running a SQL query to the most popular databases. For example in NAGIOS you can use check_sql. However this brings along some other requirements as well. First of all, PostgreSQL only listens on the loopback interface by default.
Change the postgresql.conf file like so
1listen_addresses='*' 2# Redirect output to pg_log directory 3logging_collector = on 4 5# log timestamp, pid, session log line#, user, db 6log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' 7log_min_messages=warning 8log_min_error_statement=warning 9 10# log SQL which takes longer than 3000ms. 11log_min_duration_statement=3000 12log_lock_waits = on 13 14# log temp files with size >= 1024kb 15log_temp_files=1024 16 17# name so it can be rotated using Linux logrotate 18log_filename='postgresql%Z.log' 19 20# disable log rotation as it is handled by logrotate 21log_rotation_age='0' 22log_rotation_size='0' 23 24# Do not report debug and notice level messages. 25client_min_messages=warning 26 27max_connections = 100 28 29effective_cache_size = 128MB 30shared_buffers = 8MB 31work_mem = 1MB 32maintenance_work_mem = 16MB 33wal_buffers = 64kB
After that, we need to setup client authentication. The below example ‘trusts’ the monitoring system to login as any user, without supplying a password. Obviously don’t do this in a production environment. Setup a monitoring account with RO access instead.
Change the pg_hba.conf file like so
1#/var/lib/pgsql/data/pg_hba.conf 2 3# Trust connections over the Unix domain socket 4local all all trust 5host all all 127.0.0.1/32 trust 6host all all <your monitoring server IP>/32 trust
Next, restart the services (or the full appliance) and you are in business!
When you are in trouble you should engage EMC support immediatly. EMC does not allow customers to self-help when there are issues. However, that hasn’t stopped us from trying ourself, after all no one likes waiting for support when a serious issue is bringing down production. Here is a quick overview of useful commands:
|Create database dump||
You can find the most recently updated log files like this:
1find / -name '*.log' -printf "%T+\t%p\n" | sort
Overall, these are the most important logs to watch:
Quick summary on my experience
To sum up my experience:
- Every appliance needs it’s own ESRS
- Does not integrate with existing ESRS for VPLX etc.
- Fixed in 4.0.300
- No third party monitoring supported.
- Does not easily integrate with existing vSphere environments.
- Using VUM is not supported.
- Integrating with existing DvSwitch is not supported.
- No live migrations from other cluster when using DvSwitch.
- Renaming appliance VM is not supported.
- Fixed in 4.0.300
- Behind on software updates (ESXi, vCenter, vSan).
- vCenter 6.5 not supported.
- Fixed for external vCenter in 4.0.310 (Sept 2017).
- Native support in 4.5.
- No CLI.
- Dell EMC Engineering team will do the VxRail analysis at all times. Any issue you face on VxRail you need to contact Dell EMC Engineering team for help.
If you’ve got any questions regarding this feel free to reach out to me on Twitter.