In part 1, I covered the basics of how Amazee Labs thinks about Maintenance and the way we approach automating the parts of the process that are repetitive. I also introduced three categories of problems, as well as the two kinds of maintenance engineering.
In part 2 of this two-part series, I’ll introduce the automation of problem detection and resolution. I’ll also introduce Lagoon Insights and the Amazee Labs Maintenance Tooling. Finally, I’ll explain how this all comes together in our service offering.
Automation of problem detection and resolution
When it comes to corrective maintenance, automation almost always focuses on problem detection. If corrective maintenance has been engaged, it's a good indication that there either wasn’t a preemptive maintenance plan, the preemptive maintenance plan just didn’t cover the particular problem, or the preemptive maintenance plan failed entirely.
Preemptive maintenance is a bit more nuanced due to the fact that it's related to planning maintenance to reduce the chaos of managing corrective maintenance interventions.
The following table breaks down how we think about the automatability of the problem categories we introduced earlier.
However, when it comes to fixing these problems, unfortunately, the fixes are almost impossible to automate. This is where we bring in engineering expertise to resolve the problems. Performance problems: Because these problems are almost always related to operational expectations not being met, the problem detection can be automated. SEO goals, accessibility targets, conversion rates, and technical performance are all covered by pretty great tools that are operated by experts in these fields.
Reliability category: As mentioned, reliability is the site's propensity to be available when someone tries to use it. Again problem detection is highly automatable. When a site goes down due to platform or infrastructure problems, often the hosting provider will alert you automatically. In most cases with a reasonable quality hosting provider, the problem will be resolved without the maintenance teams involvement.
Resolutions for reliability problems relating to application configuration or application code can often be automated as well. For instance, one of the policies can remove administrator roles from Drupal users that might have been granted the role by mistake.
Security category: Security problem detection is also highly automatable. This of course does not entirely negate the effectiveness or need for penetration testing or security audits. But a lot of potential problems can be detected. Additionally, the application of updates and patches in the security category is also highly automatable in the right environment. The caveat here is that to achieve this at a reasonable scale, the sites themselves need to meet certain configuration, architecture, and hosting requirements. This category really brings the “cattle not pets” analogy to bear.
Amazee Labs Tooling and the Lagoon Insights
Large parts of the maintenance automation that we have built rests on the Lagoon Insights components of the amazee.io open-source container-based hosting platform. We co-developed a lot of the features in Lagoon Insights along with amazee.io, and contributed these features to the Lagoon Open Source project.
Although sites hosted on Lagoon benefit from tighter integration with the insights systems, the Amazee Labs tooling has been developed to be interoperable with hosting providers such as Acquia Cloud, Platform.sh, and Pantheon, as well as legacy infrastructure where Drush can execute commands.
The problems insights component provides a robust database and API for storing and querying detected problems in an appropriate data structure.
While most of this article has focused on problem detection and resolution, another aspect of maintenance engineering is to be able to quickly query and collect fact-based information about the systems under maintenance. So for example, while a problem might be phrased as “The webform module needs a security patch applied”, a fact would be “The webform module is on version 8.3.1”. Facts are objective details about the systems, whereas problems are by their nature the opinionated siblings of facts.
There are four systems that collect the information to populate the Problems and Facts Insight databases.
Drutiny: Drutiny is a generic Drupal site auditing and optional remediation tool.
Fun Fact: The name is a portmanteau of "Drupal" and "scrutiny."
Another fun fact: Due to its extensible architecture, we can extend Drutiny to scan more than Drupal sites!
Harbor: Harbor is an open-source trusted cloud native registry project that stores, signs, and scans content. Harbor extends the open-source Docker Distribution by adding the functionalities usually required by users such as security, identity and management. Additionally, Harbor has the ability to do vulnerability scanning. Harbor scans images regularly for vulnerabilities and has policy checks to prevent vulnerable images from being deployed.
Trivy: Trivy is a simple and comprehensive vulnerability scanner for containers and other artifacts, suitable for continuous integration environments. The Harbor Scanner Adapter for Trivy is a service that translates the Harbor scanning API into Trivy commands and allows Harbor to use Trivy for providing vulnerability reports on images stored in Harbor registry as part of its vulnerability scan feature.
Fact collectors: We have developed several approaches to automatically scanning a site for the facts that we are interested in. We collect everything from PHP versions in use through to module version, Drupal core version, enabled user counts, etc
Notifications: Finally, the notification engine radiates the information collected and stored in the problems database, and radiates it to any of the notification types supported by Lagoon. We typically use Slack for this.
Putting it all together
Having covered both the way we think about maintenance automation and the tools involved in it, how does this all come together in our Managed Web Maintenance service?
When it comes to automation, we always ask: “Are we automating problem detection or problem resolution?”. And of course, sometimes we’re automating other parts of our process, but mainly we focus on problems.
For performance problems, we integrate certain 3rd party tools to detect conversion issues, accessibility issues, page speed issues, etc. Detected problems are analysed by our maintenance team, and are dealt with on a regular basis.
For reliability and security problems, we have developed a number of our own Drutiny policies and facts scanners. These range from detecting if certain modules are enabled on production sites, through to automated checks of broken links or commonly misconfigured modules. While a fair percentage of the problems are dealt with manually, we’re now in the stage of automating problem remediation for security-related issues. This means, automatically updating code bases, running automated tests to confirm the sites critical paths are still operational, and committing the changes to production.
We also rely heavily on Harbor and Trivy to provide us with an infrastructure level view of the software underlying the Drupal instances we maintain. Detected problems are regularly analysed by our engineering team, and remediation plans are presented to customers for execution.
As we continue to develop and extend our maintenance service, our maintenance engineers work on both developing automated ways to detect and resolve problems, as well as doing the manual work of the human intervention when it's needed. This provides the maintenance engineers with critical information on what to automate next, and which features would be the most valuable for the largest number of sites in our portfolio.
Did you know?
The Amazee Labs Managed Web Maintenance Scanning Software performs
- Drupal core status, contrib module status, and best practice drupal configuration scanning
- Application dependency (Symfony, 3rd party libraries, etc) security scanning
- Container security scanning for customers using containers
- Performance and uptime monitoring
- Automatic HTTPS Certificate validation
Our systems can be configured and augmented to perform custom checks specifically for your website context.
Do you have one or many Drupal sites which are missing a preemptive maintenance strategy?
Do you want to learn more about this maintenance service and how it can solve your maintenance pain?
Then reach out to us and let's chat!