2017-03-05



alvinashcraft
shared this story
from IndexOutOfRange.

While I’m working on the next angle on how to speed up calculating similarities I started investigating how to get better telemetry from cookit. Getting telemetry is easy - making sense of it is the hard part. This also brought another pain point of current setup - logging and monitoring. Since cookit is my pet, nonprofit project it was time to do something.

There is a comparison table at the end, and what I’ve choose.

The current setup is based on:

NLog - used for almost all logging in the website and in the Windows Service (You can read more about cookit architecture here). Most errors are logged into files, the critical ones are emailed to me.

Hangfire dashboard - for checking job status and if it ended with a failure. If it crashed, only the main exception is displayed.

Remote Desktop - for troubleshooting, performance counters and all monitoring stuff (trust me I know how badly RDP and monitoring placed in one sentence sounds :( ).

Except for the last one the setup was more or less OK on the daily basis. But in more advanced scenarios it started lacking more and more:

email notifications are fine for being informed that something is wrong, but for the details, I have to RDP and see the files and this is tolerable only on a PC. Doing this on a tablet or a phone is wasting time. Another problem is that if something goes wrong I get thousands of emails.

no performance metrics except for the Hangfire dashboard.

since I am running cookit on crappy hardware RDP is slow.

any analytics or finding correlations had to be done manually. Doing it this way is a pain and a waste of my time.

no way to do a post mortem why the site was slow an hour ago. The only source are the NLog files.

So what I wanted to achieve (in order of importance):

Visualizations. - Yes, this is the first place. Looking at tables is the worst way to get a glimpse of what is happening. A simple chart makes a huge difference.

Accessible everywhere. - This means o need to RDP or any app that has to be installed. It has to be web based.

Not running on my machine. - Every system needs some taking care of it. Checking the logs once in awhile, upgrading to a newer version and so on. This requires time and knowledge (knowledge can be translated into time needed to have it). For the current moment this is not the area where I want to invest my time, so SaaS it is.

performance counters of the machine. - Knowing the stats of the application without knowing the stats of the whole machine is close to pointless when things go haywire.

Centralized logging. - No more digging through multiple NLog files. It has to be in one place.

Custom events. - I want to add custom timings for each TPL Dataflow block that I am using in data processing workflows. Also seeing aggregated timings of Hangfire jobs would be nice and would show me where to invest time in optimizations.

Real-time. - when I get a report from Pingdom that the site is not responding I want to check what is happening right now on the machine. Hour old data, in this case, is no use.

Alerting - Having monitoring without alerting works only when you can have huge monitors displaying the dashboard in a room filled with people dedicated to monitoring one app. Even then it only works in movies. Alerting is a must have and a time saver.

Historic data. - The time between getting an alert and having time to look at it is almost always non-zero in my case. Having at least 8 hours (remember, sleep is a good thing?) of data is a must have.



Google Analytics

Google Analytics is way more than SEO/SEM tool. With the support of custom client and server events, it can be turned into monitoring tool. Another plus for using it is that I have it already and are looking at it from time to time.

The good

good/decent visualizations

very good UI (fast and intuitive)

accessible everywhere

free (at my scale)

I already have it

has custom events

monitors browser performance

supports custom client and server events

has custom dashboards

good Android application

The bad

working with custom events is not that great

no collection of system performance counters (can be easily written, but still a minus)

no log aggregation

no alerting

ELK Stack + Graphite/Graphana

ELK stack with Graphite and Grafana is the market standard for monitoring and central logging. This means that it is very easy to find Docker images, help and almost everything what is needed. Viewing on mobile devices is possible, but far from being great, and this won’t change anytime soon judging from this Github issue.

Grafana panel:

Kibana panel:

The good

the market standard

great visualizations (both Grafana and Kibana)

great data crunching tools

there are some ELK SaaS providers.

great UI and visualization

web based

there are existing applications for server stats monitoring

The bad

mostly Linux based and since I am running on a Windows Server this meant getting another Linux machine. Setting every

most providers are not free

a lot of stuff to manage on my own

no out of the box monitoring

New Relic

New Relic is a power horse when it comes to features. It has almost everything, from APM (Application Performance Management) to log aggregation. It is a very interesting product since it is done in a way that will be readable to nontechnical people. It automatically sets up notifications for apdex index violations, the UI is orientated into showing the main KPI for performance (like Google Analytics), not tech details like for example Azure Application Insights (next)

The main dashboard:

I was quite impressed with how good New Relic inspects what went into each request. This is the Request monitoring page:

Another thing that is a nice feature, and very unique, is the Geo View. It shows how page speed differs for different geo locations. And as You can see I have a problem with Russia (I have no idea why):

Installation is the easiest of all reviewed. All that is needed is to install the NewRelic Agent on the server. Dependencies detection, metering, even browser performance tracking works out of the box.

One last thing to mention, and it may be subjective. I had the impression that after installing the NewRelic’s agent on the server it slowed down a bit. I could be due to the fact that Application Insights was also running on it, or that my machine is just slow as it is. If someone else also has this impression ping me.

To sum up - New Relic is a powerful beast with the looks that a manager can understand :)

The good

out of the box monitoring of .NET, Ruby, Node.js, PHP, Java, Python and Go applications

Supports web and non-web applications

Supports browser monitoring

fast and very good UI

Browser profiling works without the need to add a script (it is added automatically by the installed NewRelic agent)

very easy to install

The bad

only paid version (I get why they have only a paid version. This is a minus for me when choosing for this use case. Remember this is a personal review).

it consists of multiple products, and even when in trial mode, enabling some of them requires contacting the sales department.

every sub-product is separately paid.

at registration(demo) requires the phone number, company name, company size, and role.

uninstalling requires a machine reboot.

expensive

confusing pricing

Retrace (Stackify)

Retrace is a product from the same guys (and girls) that developed Prefix - a good local profiler. I must say I was impressed by the features and polish that this product has. The UI may not show it fully, but it is not that far away from the big players. In some features, it’s even way ahead.

Retrace panel:

Retrace APM+ panel (from a demo application they provide. I was not able to get it working on my project):

The good

registration initially requires only Name, Surname and email are required

Show more