HealOps, the healing monitoring framework. Part 1 - Why HealOps?

Intro and info.

The diagram above is a context diagram over HealOps. It displays the components that are a part of HealOps. What they deliver to the system and whether they input or extract data in relation to their role in the system. Finally the actors of the system can be identified.

HealOps is a monitoring and healing framework, developed primarily in PowerShell, for IT systems and their components. I designed and developed it with the motivation of getting fewer on-duty calls, save time and as a way to reach a system uptime that is as near 100% as possible. The thought is that by creating a system, that uses the TDD (Test Driven Development) process. You have a clear and formalized set of rules to test the state of an IT system and its components. This then makes it possible to act and repair an identified broken system state. The identification and repairing of a broken system state happens automatically. Throughout this cycle of testing and potentially repairing state, state data is automatically reported to a Time Series Database. State data can then be pulled into a monitoring and visualization system like Graphana. From Grafana and other systems like it, it is possible to integrate with an incident management system. With this setup a potential broken state of an IT system, that could not be fixed, is reported to Grafana, which in turn sends a request to an incident management system. Then the incident management system can be configured to automatically trigger an alert to the on-call duty personnel. Thereby getting us some human eyeballs on the issue.

A visualization of the above description of state monitoring and repairing will likely help on understanding HealOps. A sequence diagram comes in handy.

The above image shows the main functions involved when testing and repairing the state of an IT system component.

For everything related to HealOps, look now further:

  • Code > here
  • News > Right here where you are. My Bengtssondd blog.
  • The occasional update > Twitter

Spill the beans. What are the benefits of using HealOps?

  • You save time by reducing the number of times where you have to “pull the chestnuts out of the fire”.
  • You save up energy and lust for fixing and improving instead of just maintaining and keeping above water.
  • Monitoring and healing via HealOps happens automatically. A scheduler is used as a facilitator for running jobs that monitors the state of one or more components of an IT system. Thereby reducing the reaction time from identifying a broken state of a service until a repair of that service kicks in.
  • What is being done and what was done is clear to everyone.
    • How and what is tested is clearly formalized in code in a test file.
    • State data is reported to a monitoring and visualization backend system.
    • Reports can be generated on state data. As well as on the cases where an alarm was triggered for human intervention.
    • HealOps logs what it does along the way.
  • When you are on-call duty and get an alarm. The info you get along with the alarm is at a high detail level. As you can send along a screenshot of what triggered an alarm in Grafana. Also, it is possible, in most incident management systems to specify information to send with a triggered alarm. This could be a wiki article, tips and tricks, what often resolves a given case and so forth.
  • HealOps could save you money on personnel that is employed only to glance at screens in order to react in case of a state going bunkers. As HealOps automatically identifies a state, very likely quicker than a human, and then automatically tries to repair the broken state.
    • The money saved could then be used on personnel that improves the product you sell. The end-customer experience. Company facilities and so on.
  • It is open-source and free.
    • You can review the code.
    • Help and commit new features and fix bugs. And thereby speed-up the delivery of the features you would like to have included or the bugs that needs to be fixed.

Any downsides?

Fair question. Here’s the ones I’ve been able to identify so far.

  • Only one Time Series Database backend is currently supported. Namely OpenTSDB.
    • It’s the plan to support several more over time.
  • There is only a few HealOps packages currently available. A HealOps package is the entity that holds Tests and Repairs files. A Tests file is invoked in order to verify the state of a component of an IT system. If its state is broken it will be tried repaired via the Repairs file. As you can understand, these packages are both an integral part, as well as very useful. So you will have to do a little bit of work here, in order to be able to us HealOps together with the IT systems you have. Luckily I’ve designed the way to develop a HealOps package so that it could be standardized and easy to follow. So with a little bit of practice and by reading the (upcoming) documentation you should be able to do like a boss.
  • Running HealOps on Linux and MacOS is right now not officially supported. But it is the number one priority. The plan is to support it within the next 2-4 months.
  • As HealOps is developed in PowerShell it is not extremely performant. However, I’m considering refactoring parts of HealOps. Meaning that HealOps, in the future, might become a binary PowerShell module.

HealOps needs a back rub.

There are several features I would like to implement in HealOps. Number one on the list is job scheduling on Linux and MacOS. But there is more. Full testing of the HealOps code. In Pester. Documentation. Considering supporting container virtualization. Creating prototypes for that….and the list goes on. But time is flying.

So, if you think HealOps is a cool system and the type of project you could see yourself spend time on. Be my guest. I would highly appreciate it. If you have any questions or need more info, before you commit yourself. Feel free to hit me up on Twitter or LinkedIn (links for those services can be found in the lower left corner on this blog).

Coming up.

In part two on HealOps, I plan to describe the components of HealOps itself in great detail. The technical nerdy stuff. I hope you will read along when that comes out.

Until then you are more than welcome to have a look at the code. Find it here.

Adieu.

I hope you enjoyed this post and that you will checkout HealOps. Thank you for reading along.

Over and out :dash:


© 2020. All rights reserved.

Powered by Hydejack v7.5.0