- Created by Anton Kronseder, last modified by Robert Reiner on 01. Nov 2020
Provides detailed information about the architecture of the system.
- Name
- Software Architecture Documentation (Single Page)
- Tags
- Overview
- Illegal value
The use of a dynamic property value is not allowed at this position.
Dynamic property values may cause problems when used unintentionally.
The administrators have configured to prevent the use of macro 'projectdoc-link-wiki' identified by 'e110e4b3-ccb5-4789-85bb-46fe07415cdd' in mode 'allow' on page HSCAD.SWAD (Single Page) (15761507) in the context of a document property value.
For more information on the use of dynamic property values, please visit No dynamic Property Values.
- Multipage
Goals of this Documentation
This documentation is an example of arc42 documentation.
You may copy this documentation or parts of it for your own projects. In such cases you must include a link or reference to arc42 or aim42 (we regard this as fair-use).
For real-world projects, the relation of code and documentation is oversized.
Disclaimer
We provide absolutely no guarantee, neither for the accuracy of this documentation nor for any property or feature of the software described here.
Do not use this software in critical situations or projects.
Overview
- Introduction and Goals
- Architecture Constraints
- System Scope and Context
- Solution Strategy
- Building Block View
- Structural views on the system.
- Runtime View
- Deployment View
- The three nodes (computers) shown in Deployment are connected via public internet.
- Concepts
- Design Decisions
- Quality Scenarios
- Technical Risks
- Glossary Items
- List of central glossary items.
Introduction and Goals
Requirements Overview
Basic Usage
- A user configures the location (directory and filename) of an HTML file
- and the corresponding images directory.
HtmlSC performs various checks on the HTML and
reports its results either on the console or as HTML report.
HtmlSC can run from the command line or as Gradle-plugin.
Terminology: What Can Go Wrong in HTML Files?
Apart from purely syntactical errors, many things can go wrong in html, especially with respect to hyperlinks, anchors and id’s - as those are often manually maintained.
Broken Cross References:: Cross-references (internal links) can be broken, e.g. due to missing or misspelled link-target.
Missing local resources: Referenced local resources (other than images) can be missing or misspelled.
Duplicate link targets: link-targets can occur several times with the same name - so the browser cannot know which is the desired target
Illegal links:: The links (aka anchors or URIs) can contain illegal characters or violate HTML link syntax.
See <<IllegalLinkChecker>>
Broken external links: External links can be broken due to myriads of reasons: misspelled, link-target currently offline, illegal link syntax.
Missing Alt Attribute in Image Tags: Images missing an alt-attribute.
Checking and reporting these errors and flaws is the central business requirement of HtmlSC.
Important terms (domain terms) of html sanity checking is documented in a (small) domain model.
General Functionality
ID | Functionality | Description |
---|---|---|
G-1 | read HTML file | HtmlSC shall read a single (configurable) HTML file. |
G-2 | Gradle-plugin | HtmlSC can be run as Gradle-plugin. |
G-3 | command line usage | HtmlSC can be called from the command line with arguments and options. |
G-4 | configurable output | Output can be configured to console or file. |
G-5 | free and open source | All required dependencies shall be compliant to the CC-SA-4 licence. |
G-6 | available via public repositories, like bintray or jcenter. | G-7 |
Types of Sanity Checks
ID | Check | Description |
---|---|---|
R-1 | missing image files | Check all image tags if the referenced image files exist. See |
R-2 | broken internal links | Check all internal links from anchor-tags ( |
R-3 | missing local files | Either other html-files, pdf’s or similar. See |
R-4 | duplicate link targets | Check all bookmark definitions (… |
R-5 | Malformed links | Check all links for syntactical correctness. |
R-6 | missing alt-attribute | In image-tags. See |
R-7 | unused-images | Check for files in image-directories that are not referenced by any of the HTML files in this run. |
R-8 | illegal link targets | Checks for malformed or illegal anchors (link targets). |
ID | Check | Description |
---|---|---|
Opt-1 | missing external images | Check externally referenced images for availability. |
Opt-2 | broken external links | Check external links for both syntax and availability. |
Reporting and Output Requirements
ID | Requirement | Description |
---|---|---|
Rep-1 | various output formats | Checking output in plaintext and HTML. |
Rep-2 | output to stdout | HtmlSC can output results on stdout (the console). |
Rep-3 | configurable file output | HtmlSC can store results in file in configurable directories. |
Quality Goals
Stakeholders
Role | Description | Goal/Intention | Type |
---|---|---|---|
aim42 contributor | contributes to aim42 methode-guide | check generated html code to ensure links and images are correct during (gradle-based) build process | role |
arc42 user | uses the arc42 template for architecture documentation | wants a small but practical example of how to apply arc42. | role |
Documentation author | writes documentation with Html output | wants to check that the resulting document contains good links, image references | role |
software developer | creates software and needs provide documentation for it | wants an example of pragmatic architecture documentation and arc42 usage | role |
Architecture Constraints
HtmlSC shall be:
- developed under a liberal open-source license
- integrated with the Gradle build tool
- platform independend and should run on the major operating systems (Windows™, Linux and Mac-OS™)
- runnable from the command line
System Scope and Context
Business Context
Context
Elements
Table 7. Business Context
Neighbour | Description |
---|---|
documents software with toolchain that generates html. Wants to ensure that links within this html are valid. | |
build system | |
HtmlSC reads and parses local html files and performs sanity checks within those. | |
HtmlSC checks if linked images exist as (local) files. | |
Optionally HtmlSC can be configured to check for the existance of external web resources. Due to the nature of web systems, this check might need a significant amount of time and might yield invalid results due to network and latency issues. |
Deployment Context
Context
Elements
Doctype | Node / Artifact | Description |
---|---|---|
node | artifact repository | global public cloud repository for binary artifacts, similar to mavenCentral. HtmlSC binaries are uploaded to this server. |
artifact | build.gradle | Gradle build script configuring (among other things) the HtmlSC plugin to perform the Html checking. |
node | hsc user computer | where arbitrary documentation takes place with html as output formats. |
node | hsc-development | where development of HtmlSC takes place |
artifact | hsc-plugin-binary | Compiled and packaged version of HtmlSC including required dependencies. |
Details see Deployment View.
Solution Strategy
Implement HtmlSC in Groovy and Java with minimal external dependencies. Wrap this implementation into a Gradle plugin, so it can be used within automated builds. Details are given in the Gradle plugin concept.
Apply the template-method-pattern (see e.g. Template-Method-Pattern) to enable
multiple checking algorithms. See the concept for checking algorithms.
both HTML (file) and text (console) output. See the reporting concept.
Building Block View
Whiteboxes Level 1
Components of HTML Sanity Checker - Whitebox
Diagram
Blackboxes
Building Block | Description |
---|---|
CheckerCore | core: html parsing and sanity checking, file handling |
HSC Command Line Interface | (not documented) |
HSC Gradle Plugin | integrates the Gradle build tool with HtmlSC, enabling arbitrary gradle builds to use HtmlSC functionality. |
HSC Graphical Interface | (planned, not implemented) |
Reporter | outputs the collected checking results to configurable destinations, e.g. StdOut or a Html file. |
Description
We used functional decomposition to separate responsibilities:
- CheckerCore shall encapsulate checking logic and Html parsing/processing.
- all kinds of outputs (console, html-file, graphical) shall be handled in a separate component (Reporter)
- Implementation of Gradle specific stuff shall be encapsulated.
Internal Interfaces
Interface | Description |
---|---|
build system | currently restricted to Gradle: The build system uses HtmlSC as configured in the buildscript. |
local-html and local-images | HtmlSC needs access to several local files, especially the html page to be checked and to the corresponding image directories. |
usage via shell | arc42 user uses a command line shell to call the HtmlSC |
Blackboxes Level 1
CheckerCore
Description
Checker contains the core functions to perform the various sanity checks. It parses the html file into a DOM-like in-memory representation, which is then used to perform the actual checks.Purpose
core: html parsing and sanity checking, file handlingProvided Interfaces
Interface (From-To) | Description |
---|---|
Command Line Interface → Checker | Exposes the #AllChecksRunner class, as described in AllChecksRunner. |
Gradle Plugin → Checker | Exposes HtmlSC via a standard Gradle plugin, as described in the Gradle user guide. |
Files
org.aim42.htmlsc.AllChecksRunner
org.aim42.htmlsc.HtmlSanityCheckGradlePlugin
HSC Command Line Interface
Purpose
(not documented)HSC Gradle Plugin
Purpose
integrates the Gradle build tool with HtmlSC, enabling arbitrary gradle builds to use HtmlSC functionality.HSC Graphical Interface
Purpose
(planned, not implemented)Reporter
Purpose
outputs the collected checking results to configurable destinations, e.g. StdOut or a Html file.Whiteboxes Level 2
CheckerCore - Whitebox
Diagram
Blackboxes
Building Block | Description |
---|---|
[ResultsCollector] | Collects all checking results. Its interface Results is contained in the whitebox description |
AllChecksRunner | Facade to the different Checker instances. Provides a (parameter-driven) command-line interface. |
Checker | abstract class, used in form of the template-pattern. Shall be subclassed for all checking algorithms. |
HtmlParser | Encapsulates html parsing, provides methods to search within the (parsed) html. |
Description
This structures follows a strictly functional decomposition:
- parsing and handling html input
- checking
- collecting checking results
Blackboxes Level 2
[ResultsCollector]
Purpose
Collects all checking results. Its interface Results is contained in the whitebox descriptionAllChecksRunner
Purpose
Facade to the different Checker instances. Provides a (parameter-driven) command-line interface.Checker
Description
The abstract Checker provides a uniform interface (public void check()) to different checking algorithms. It is based upon the concept of extensible checking algorithms.
Purpose
abstract class, used in form of the template-pattern. Shall be subclassed for all checking algorithms.HtmlParser
Purpose
Encapsulates html parsing, provides methods to search within the (parsed) html.Whiteboxes Level 3
[ResultsCollector] - Whitebox
Diagram
Blackboxes
Building Blocks | Description |
---|---|
Finding | a single finding, (e.g. "image 'logo.png' misssing"). Can hold suggestions and (planned for future releases) the responsible html element. |
Per-Run Results | results for potentially many Html pages/documents. |
Single-Check-Results | results for a single type of check (e.g. missing-images check) |
Single-Page-Results | results for a single page |
Description
This structures follows the hierarchy of checks - namely managing results for:
- a number of pages/documents, containing:
- a single page, each containing many
- single checks within a page
Internal Interfaces
Interface | Description |
---|---|
Results | The Result interface is used by all clients (especially Reporter subclasses, graphical and command-line clients) to access checking results. It consists of three distinct APIs for overall RunResults, single-page results (PageResults) and single-check results (CheckResults). |
Blackboxes Level 3
Finding
Purpose
a single finding, (e.g. "image 'logo.png' misssing"). Can hold suggestions and (planned for future releases) the responsible html element.Per-Run Results
Purpose
results for potentially many Html pages/documents.Single-Check-Results
Purpose
results for a single type of check (e.g. missing-images check)Single-Page-Results
Purpose
results for a single pageRuntime View
Deployment View
Context
Elements
Doctype | Node / Artifact | Description |
---|---|---|
node | artifact repository | global public cloud repository for binary artifacts, similar to mavenCentral. HtmlSC binaries are uploaded to this server. |
artifact | build.gradle | Gradle build script configuring (among other things) the HtmlSC plugin to perform the Html checking. |
node | hsc user computer | where arbitrary documentation takes place with html as output formats. |
node | hsc-development | where development of HtmlSC takes place |
artifact | hsc-plugin-binary | Compiled and packaged version of HtmlSC including required dependencies. |
Description
The three nodes (computers) shown in Deployment are connected via public internet.
Sanity checker will:
be bundled as a single jar.
be uploaded to the Bintray repository,
referencable within a gradle buildfile.
provide a main method with parameters and options, so all checks can be called from the command line.
Concepts
Domain Model
Diagram
Elements of the Domain
Name | Short Description |
---|---|
Anchor | Html element to create →Links. Contain link-targets in the form <a href="link-target"> |
Cross reference | Link from one part of the document to another part within the same document. A special form of →internal-link, with a →link-target in the same document. |
external link | Link to another page or resource at another domain. |
Finding | Description of a problem found by one →Checker within the →Html Page. |
Html Element | HTML pages (documents) are made up by HTML elements, .e.g. <a href="link target">, <img src="image.png"> and others. See the W3-Consortium. |
HTML Page | A single chunk of HTML, mostly regarded as a single file. Shall comply to standard HTML syntax. Minimal requirement: Our HTML parser can successfully parse this page. Contains →Html Elements. Also called html document. |
id | Identifier for a specific part of a document, e.g. <h2 id="#someHeader">. Often used to describe →link targets. |
internal link | Link to another section of the same page or to another page of the same domain. Also called local link. |
Link | Any a reference in the →html page that lets you display or activate another part of this document (internal ink) or another document, image or resource (can be either →internal (local) or →external link). Every link leads from the link source to the link target. |
Link Target | The target of any →link, e.g. heading or any other a part of a html document, any internal or external resource (identified by URI). Expressed by →id. |
local resource | Local file, either other html files or other filetypes (e.g. pdf, docx). |
Run Result | The overall results of checking a number of pages (at least one page). |
Single Page Result | A collection of all checks of a single → HTML Page. |
URI | Universal Resource Identifier. Defined in RFC-2396. The ultimate source of truth concerning link syntax and semantic. |
Gradle Plugin Concept and Development
Description
You should definitely read the original [Gradle User Guide] on custom plugin development.
To enable the required Gradle integration, we implement a lean wrapper as described in the Gradle user guide.
class HtmlSanityCheckPlugin implements Plugin<Project> { void apply(Project project) { project.task('htmlSanityCheck', type: HtmlSanityCheckTask, group: 'Check') } }
Directory Structure and Required Files
|-htmlSanityCheck | |-src | | |-main | | | |-org | | | | |-aim42 | | | | | |-htmlsanitycheck | | | | | | | ... | | | | | | |-HtmlSanityCheckPlugin.groovy (1) | | | | | | |-HtmlSanityCheckTask.groovy | | | |-resources | | | | |-META-INF (2) | | | | | |-gradle-plugins | | | | | | |-htmlSanityCheck.properties (3) | | |-test | | | |-org | | | | |-aim42 | | | | | |-htmlsanitycheck | | | | | | | ... | | | | | | |-HtmlSanityCheckPluginTest |
- the actual plugin code: HtmlSanityCheckPlugin and HtmlSanityCheckTask groovy files
- Gradle expects plugin properties in META-INF
- Property file containing the name of the actual implementation class:
Passing Parameters From Buildfile to Plugin
To be done
Building the Plugin
The plugin code itself is built with Gradle.
Uploading to Public Archives
To be done
Further Information on Creating Gradle Plugins
Although writing plugins is described in the Gradle user guide, a clearly explained sample is given in a Code4Reference tutorial.
Flexible Checking Algorithms
Description
Reason: Invalid scheme for URL. Supported schemes are: [http, https, ftp, ftps, mailto, nntp, news, irc]
We achieve that by defining the skeleton of the checking algorithm in one operation, deferring the specific checking algorithm steps to subclasses.
The invariant steps are implemented in the abstract base class, while the variant checking algorithms have to be provided by the subclasses.
/** * template method for performing a single type of checks * on the given @see HtmlPage. * * Prerequisite: pageToCheck has been successfully parsed, * prior to constructing this Checker instance. */ public CheckingResultsCollector performCheck() { // assert prerequisite assert pageToCheck != null initResults() return check() // execute the actual checking algorithm }
Context
Elements
Name | Short Description |
---|---|
abstract class, used in form of the template-pattern. Shall be subclassed for all checking algorithms. | |
checks if referenced local image files exist | |
checks if cross references (links referenced within the page) exist | |
checks if any id has multiple definitions |
Flexible Reporting
Description
HtmlSC allows for different output formats:
formats (HTML and text) and
destinations (file and console)
The reporting subsystem uses the template method pattern to allow different output formats (e.g. Console and HTML). The overall structure of reports is always the same:
Graphical clients can use the API of the reporting subsystem to display reports in arbitrary formats.
The (generic and abstract) reporting is implemented in the abstract Reporter class as follows:
/** * main entry point for reporting - to be called when a report is requested * Uses template-method to delegate concrete implementations to subclasses */ public void reportFindings() { initReport() (1) reportOverallSummary() (2) reportAllPages() (3) closeReport() (4) } // private void reportAllPages() { pageResults.each { pageResult -> reportPageSummary( pageResult ) (5) pageResult.singleCheckResults.each { resultForOneCheck -> reportSingleCheckSummary( resultForOneCheck ) (6) reportSingleCheckDetails( resultForOneCheck ) (7) reportPageFooter() (8) } }
- initialize the report, e.g. create and open the file, copy css-, javascript and image files.
- create the overall summary, with the overall success percentage and a list of all checked pages with their success rate.
- iterate over all pages
- write report footer - in HTML report also create back-to-top-link
- for a single page, report the nr of checks and problems plus the success rate
- for every singleCheck on that page, report a summary and
- all detailed findings for a singleCheck.
- for every checked page, create a footer, pagebreak or similar to graphically distringuish pages between each other.
Styling the Reporting Output
Description
The HtmlReporter explicitly generates css classes together with the html elements, based upon css styling re-used from the Gradle JUnit plugin.
Stylesheets, a minimized version of JQuery javascript library plus some icons are copied at report-generation time from the jar-file to the report output directory.
Styling the back-to-top arror/button is done as a combination of JavaScript plus some css styling, as described in http://www.webtipblog.com/adding-scroll-top-button-website/.
Attributions
Credits for the arrow-icon https://www.iconfinder.com/icons/118743/arrow_up_icon
Design Decisions
HTML Parsing with jsoup
Details
To check HTML we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.
To quote from the jsoup website:
Find details on how HtmlSC implements HTML parsing in the HTML encapsulation concept.jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Relevance
Check HTML programatically by using an existing API that provides access and finder methods to the DOM-tree of the file(s) to be checked.Requirements
- few dependencies, so the HtmlSC binary stays as small as possible.
- accessor and finder methods to find images, links and link-targets within the DOM tree.
Alternatives
- HTTPUnit: a testing framework for web applications and -sites. Its main focus is web testing and it suffers from a large number of dependencies.
- jsoup: a plain HTML parser without any dependencies (!) and a rich api to access all HTML elements in DOM-like syntax.
Checking of external links postponed
Details
In the current {revision} we won’t check external links. These checks have been postponed to later versions.String Similarity Checking with Jaro-Winkler-Distance
Details
The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is not available as public binary, we use the sources instead, primarily:
net.ricecode.similarity.JaroWinklerStrategyTest
net.ricecode.similarity.JaroWinklerStrategy
The actual implementation of the similarity comparison has been postponed to a later release of HtmlSC
HTML Parsing with jsoup
Problem
Details
To check HTML we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.
To quote from the jsoup website:
Find details on how HtmlSC implements HTML parsing in the HTML encapsulation concept.jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Relevance
Check HTML programatically by using an existing API that provides access and finder methods to the DOM-tree of the file(s) to be checked.Problem Constraints
Requirements
- few dependencies, so the HtmlSC binary stays as small as possible.
- accessor and finder methods to find images, links and link-targets within the DOM tree.
Alternatives
- HTTPUnit: a testing framework for web applications and -sites. Its main focus is web testing and it suffers from a large number of dependencies.
- jsoup: a plain HTML parser without any dependencies (!) and a rich api to access all HTML elements in DOM-like syntax.
Resources
- Find details on how HtmlSC implements HTML parsing in the HTML encapsulation concept.
Checking of external links postponed
Problem
Details
In the current {revision} we won’t check external links. These checks have been postponed to later versions.String Similarity Checking with Jaro-Winkler-Distance
Problem
Details
The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is not available as public binary, we use the sources instead, primarily:
net.ricecode.similarity.JaroWinklerStrategyTest
net.ricecode.similarity.JaroWinklerStrategy
The actual implementation of the similarity comparison has been postponed to a later release of HtmlSC