Performs some basic string formatting on the HTML documents generated by pasting Word documents into Dreamweaver to help conform the documents to WCAG/WET standards. It assumes proper indentation (in Dreamweaver, you can generate this with Edit -> Code -> Apply Source Formatting).
HTML page of tool here.
The current formatting checks are as follows (the user can select which to apply). Checks that are generally applicable to all Dreamweaver-pasted documents are checked by default; checks that may only be useful for specific documents are unchecked by default.
List of checks
- replaces the formatting for Dreamweaver-generated footnotes with WET footnotes. This uses helper functions from the footnote_formatter tool. It assumes explicit Dreamweaver formatting for footnote strings; the footnote formatter tool should be used if this is not the case.
- The WET footnote structure is English by default; use the checks under “Translate structure for French documents” to convert it to French.
Remove Dreamweaver-generated links:
- removes empty links, or those that consist only of spaces, since they can’t be easily accessed.
- Example: the string <a href=”google.com”></a>” is removed.
- Note that this removes Dreamweaver-generated footnotes, which consist of empty links, so you should only use this check if you have already used the above check to fix the footnotes to follow WET formatting (which do not use empty links, and so are not removed).
- removes the referential links that Dreamweaver generates (which don’t properly function).
- Example: the string <a href=”#_Ref123”>Reference link 1</a> has its link removed, so it is replaced with Reference link 1.
- removes the table of contents links that Dreamweaver generates (which don’t properly function).
- Example: the string <a href=”#_Toc12a”>Toc link 1</a> has its link removed, so it is replaced with Toc link 1.
- removes the bookmark links that Dreamweaver generates (which don’t properly function).
- Example: the string <a href=”#_bookmark1”>Bookmark 1</a> has its link removed, so it is replaced with Bookmark 1.
- removes the French logiterms that Dreamweaver generates (which don’t properly function).
- Example: the string <a name=”lt_12”>Logiterm 1</a> has its link removed, so it is replaced with Logiterm 1.
Clean up spacing for coding style:
These don’t affect the document’s structural correctness for WET, but it’s helpful to keep your document tidy for visual clarity, string searching, and other coding purposes.
- makes all space encoding consistent by converting invisible non-breaking spaces (where it just looks like a regular space in the editor), tabs, and so on into regular spaces.
- replaces multiple spaces with a single space for neatness. You should apply source formatting in Dreamweaver to fix indentation if you include this check.
- removes spaces at the end of tags.
- Example: replaces the string <p >Extra spacing</p > with <p>Extra spacing</p>.
- removes empty attribute-less tags, and replaces all attribute-less tags that consist solely of spaces or nbsp; with a single regular space. This excludes br and td tags.
- Example: replaces the string <a> </a> has its empty tag removed, so it is replaced with a single space “ ”.
- removes extra spaces after opening p, li, th, td, and header (h1, h2 etc.) tags for neatness.
- Example: replaces the string <li> list item 1</li> with <li>list item 1</li>.
- removes extra spaces before closing p, li, th, td, and header (h1, h2 etc.) tags for neatness.
- Example: replaces the string <li>list item 1 </li> with <li>list item 1</li>.
- replaces the HTML entities for straight quotes apos, quot, #39, and #34 with their actual values.
- Example: replaces " 'quotes' " with ” ‘quotes’ “.
- replaces fancy (slanted) quotes with regular quotes.
- Example: replaces ‘a’ “b” with ‘a’ “b”.
- Unchecked by default.
- replaces the HTML entities for fancy quotes rsquo, lsquo, rdquo, ldquo, #8216, #8217, #8220, and #8221 with regular quotes.
- Example: replaces “ ‘Fancy quotes’ ” with ” ‘Fancy quotes’ “.
- Unchecked by default.
- replaces the HTML entities for fancy quotes rsquo, lsquo, rdquo, ldquo, #8216, #8217, #8220, and #8221 with their actual values.
- Example: replaces “ ‘Fancy quotes’ ” with “ ‘Fancy quotes’ ”.
- Unchecked by default.
- replaces Word’s em dashes with regular dashes.
- Example: replaces the string – with -.
- joins consecutive em tags and consecutive strong tags, with only spaces/newlines between them, into a single tag.
- Example 1: replaces the string <em>italics 1</em> <em>italics 2</em> with <em>italics 1 italics 2</em>.
- Example 2: replaces the string <strong>bold 1</strong><strong>bold 2</strong> with <strong>bold 1bold 2</strong>.
- joins consecutive em tags and consecutive strong tags, with only non-alphanumeric characters between them, into a single tag.
- Implements the examples from the above check.
- Example 3: replaces the string <em>italics 1</em>, <em>italics 2</em> with <em>italics 1, italics 2</em>.
- Unchecked by default.
- joins consecutive ul tags into a single list.
- joins consecutive ol tags into a single list.
- removes br tags at the start or end of p, li, td, th, and header (h1, h2 etc.) tags.
- splits up blocks of text that are separated from each other by br within a single <p> tag, by moving each block into its own <p> tag.
- splits up blocks of text that are separated from each other by br within a single <p> tag, by moving each block into its own <p> tag, if the chunk before the br ends in one of the following punctuation symbols: . , ; : ! ? ) “ ’ ”
- changes all <br> and <br/> to <br />.
- removes em and strong that consist only of spaces/newlines and the above punctuation symbols.
- Example: replaces the string “<em>, </em>” with ”, “.
- replaces common spacing and duplication misformatting for punctuation:
- ” .” and “..” with “.”
- ” ,” and “,,” with “,”
- ” ;” and “;;” with “;”
- ”: :” and “::” with “:”
These checks take precedence in the given order. For example, if both “change em tags to cite tags on lines that have links” and “change all em tags to i tags” are checked, then the first check will take precedence. So on lines that have links, em tags will be changed to cite tags, and on other lines without links, em tags will be changed to i tags.
- changes em tags to cite tags on lines that have links. If a line contains the “a” tag, then all em tags on the same line are changed to cite tags.
- Example 1: the line <p><a>link</a><em>cite</em></p> is changed to <p><a>link</a><cite>cite</cite></p>.
- Example 2: the line <p><em>non-cite</em></p> is left unchanged.
- changes all em tags to cite tags.
- Example: replaces the string <em>italics</em> with <cite>italics</cite>.
- Unchecked by default.
- changes all em tags to span class=”osfi-txt–italic” tags.
- Example: replaces the string <em>italics</em> with <span class=”osfi-txt–italic”>italics</span>”.
- Unchecked by default.
- changes all em tags to i tags.
- Example: replaces the string <em>italics</em> with <i>italics</i>.
- Unchecked by default.
- changes all strong tags to span class=”osfi-txt–bold” tags.
- Example: replaces the string <strong>bold</strong> with <span class=”osfi-txt–bold”>bold</span>”.
- Unchecked by default.
- changes all strong tags to b tags.
- Example: replaces the string <strong>bold</strong> with <b>bold</b>.
- Unchecked by default.
Add/fix/remove tag attributes:
- replaces Dreamweaver-generated center and right alignment (looks like align=”center” or align=”right”) with their respective WET classes.
- Example: replaces the string <li align=”center”>center align</li> with <li class=”text-center”>center align</li>.
- ensures that internal links to the OSFI website are relative by removing the OSFI main page from the URL, and adds rel=”external” to other links. This ignores links that have keywords indicating footnotes (“ftn” and “fnb”), table of contents (“_Toc” and “toc”), or email addresses (“mailto” and “@”). It also ignores already existing internal links (links that begin with “/Eng/” or “/Fra/”).
- Example 1: replaces the string <a href=”osfi-bsif.gc.ca/Eng/test”>internal link 1</a> with <a href=”/Eng/test”>internal link 1</a>.
- Example 2: replaces the string <a href=”https://www.google.ca/”>external link</a> with <a rel=”external” href=”https://www.google.ca/”>external link</a>.
- Example 3: the string <a href=”_ftn”>not external link</a> is left unchanged because the string “_ftn” indicates footnote.
- Example 4: the string <a href=”/Eng/test”>internal link 1</a> is left unchanged because it is already an internal link.
- removes attributes from p (paragraph) tags.
- Example: replaces the string <p test=”x”>p attribute</p> with <p>p attribute</p>.
- This should be unchecked if the input document is already WET formatted, as the attributes may have been manually inserted.
- removes attributes from ol and ul tags.
- Example: replaces the string <ol test=”x”> with <ol>.
- This should be unchecked if the input document is already WET formatted, as the attributes may have been manually inserted.
- removes attributes from table, th, and tr tags, and removes attributes from td tags except for colspan and rowspan.
- Example 1: replaces the string <table border=”1”> with <table>.
- Example 2: replaces the string <td width=”97” rowspan=”1” colspan = “1”>table 1,2</td> with <td rowspan=”1” colspan = “1”>table 1,2</td>.
- This should be unchecked if the input document is already WET formatted, as the attributes may have been manually inserted.
Translate structure for French documents:
- translates internal links to French by searching for /Eng/ and replacing with /Fra/, and translates the English WET footnote structure created by the earlier footnote check to French.
- Example 1: replaces the string <a href=”/Eng/test”>internal link 1</a> with <a href=”/Fra/test”>internal link 1</a>.
- Example 2: replaces the string Return to footnote with Retour à la référence de la note de bas de page.
- Only checked by default for French documents.
- replaces 1er with 1<sup>er</sup>, and other French list numberings formatted as #e with #<sup>e</sup>.
- Example: replaces the string 1er 2e with 1<sup>er</sup> 2<sup>e</sup>.
- Only checked by default for French documents.
- ensures that “%”, “:”, and “$” symbols have an in front of them (for French spacing around those punctuation symbols).
- Example: replaces the string dollar $ percent% with dollar $ percent %.
- Only checked by default for French documents.
- changes strings indicating superscript and subscript tags in the original Word document to be actual tags, and joins these tags together.
- Example 1: replaces the string Regular<sup>s</sup><sup>u</sup><sup>p</sup><sup>e</sup><sup>r</sup><sup>s</sup><sup>c</sup><sup>r</sup><sup>i</sup><sup>p</sup><sup>t</sup> with Regular<sup>superscript</sup>.
- Example 2: replaces the string Regular<sub>s</sub><sub>u</sub><sub>b</sub><sub>s</sub><sub>c</sub><sub>r</sub><sub>i</sub><sub>p</sub><sub>t</sub> with Regular<sub>subscript</sub>.
- changes strings indicating math tags in the original Word document to be actual tags.
I sometimes split regex statements up into multiple calls for clarity, but a lot of these checks can be done in one or two regex statements.
This is not intended to perform the complete WET formatting process and only covers the basic initial steps; further manual adjustments to the document may be required to make it fully WET valid.
Superscripts and subscripts
For the second last check, the tool looks for the following strings, which indicate fake tags for superscripts/subscripts:
- <sup>, which displays as <sup>
- <sub>, which displays as <sub>
and converts them to actual tags (so that we have a sup or sub tag at that location instead).
This check is to be used in conjunction with some manual find-and-replace in the original Word document before pasting into Dreamweaver. The idea is that when pasting a Word document into Dreamweaver, we turn off including styles because of the unnecessary css bloats it adds (you can turn including styles on/off in the Dreamweaver preferences, but they are off by default). However, superscripts and subscripts count as styles to Dreamweaver, meaning they get copied over as regular text; you have to manually insert them into Dreamweaver’s generated html document. The easiest way to do this involves marking down where these superscripts and subscripts are in the Word document before pasting.
For this tool, you should mark superscripts and subscripts in the Word document with <sup>, </sup>, <sub>, and </sub>, using the process described below.
This tool looks for strings indicating subscript/superscript tags in the html document (where the angle brackets <> have been converted to their html entities by Dreamweaver) and changes them to be actual tags. Afterwards, it joins consecutive sup and sub tags.
Steps to mark superscripts and subscripts in Word
- Open the “replace” box in Word (ctrl+h).
- Select “More »” for additional options.
- Check “Use wildcards”. This is a pattern searcher used by Word that is similar to regex.
- In the “Find what” box, enter ([!(^2)]). This is equivalent to ([^(^2)]) in regex, where ^2 is the character for a footnote/endnote in Word. In other words, we are searching for superscripts that aren’t footnotes/endnotes.
- While in the “Find what” box, select “Format”, then “Font”.
- Only check the “Superscript” box. Leave the “Subscript” box unchecked; all other boxes should be filled in, but not checked.
- Click “OK”. You are now searching for superscript text.
- In the “Replace with” box, enter <sup>\1</sup>. This is equivalent to <sup>$1</sup> in regex.
- While in the “Replace with” box, select “Format”, then “Font”.
- Only check the “Superscript” box. Leave the “Subscript” box unchecked; all other boxes should be filled in, but not checked.
- Click “OK”. You are now replacing superscript text.
- Click “Replace All” to surround all superscripts with <sup> and </sup>.
Afterwards, repeat these steps for subscripts, but use <sub>\1</sub> for step 8 instead.
Mathml
For the last check, the tool looks for the following fake tags:
- <*math.*?>
- <*mi.*?>
- <*mo.*?>
- <*mn.*?>
and replaces them with actual tags.
Similarly to superscripts/subscripts, you can’t copy a Word document with equations into Dreamweaver and have the equations formatted properly (assuming you want them formatted as mathml). Since you need to actually copy the equations one-by-one to have them written out as mathml instead of the default linear format, this is best done with a macro, which can be found here. The linked macro replaces equations with their mathml code, which this tool then fixes the tags of once the Word document is pasted into Dreamweaver.
Adding checks
The steps to add a check that follows the tool’s current formatting/organization are as follows:
- In dw_paste_format.html: Add the new check into the form where the other checks are located. Different groups of checks are separated by two <br /> instead of one; put the check in whichever group you think makes the most sense.
- In dw_paste_format_helpers.js:
- in set_default_checks(), set when it should and shouldn’t be a default check, based on how safe/useful it is, for English Word pastes, French Word pastes, English WET-formatted documents, and French WET-formatted documents.
- In format_file(), create an if statement for the new check, positioned at the same place as where you put it in the HTML document. For the sake of consistency, put the logic for the check in a helper function even if it’s only one line.
- Create the helper function for the logic below format_file(), positioned at the same place as where you put it in the HTML document.
- In README.md (this file): Add a description of the new check in the first section, positioned at the same place as where you put it in the HTML document.
- In sample_page.html: Add some text to test the new check with.