gen_dw_format

Dreamweaver Table of Contents Formatting

Formats table of contents tables generated by pasting Word documents into Dreamweaver, and adds their links to the main body following WET standards.

You should run the input document through the general Dreamweaver paste formatting tool first to reduce errors resulting from malformed documents.

I’m considering integrating this with the Dreamweaver paste formatting tool, but I haven’t yet because unlike the currently pretty safe checks in that tool, this tool is likely to introduce HTML structural errors. Also, the code itself is a lot more involved than other checks.

HTML page of tool here.

Different Word documents produce very different content structures when pasted into Dreamweaver; I’ve tried to go through a few different cases and work out some patterns, but the inconsistency here means that this tool may not be generally reliable - be sure to double check the output and click through the table of contents links to check that they work.

Inputs

  1. How the Dreamweaver table of contents is structured, i.e. which entry separators are used. The following have been implemented so far:
    • All entries are placed into a single <p> tag, and each entry is separated by <br>, <br/>, or <br />, e.g.
        <p>Entry 1 <br>
          Entry 2 <br>
          Entry 3 </p>
      
    • All entries are placed into a single <ul> tag, and each entry is its own <li> element, e.g.
        <ul>
          <li>Entry 1</li>
          <li>Entry 2</li>
          <li>Entry 3</li>
        </ul>
      
  2. Inputs for where the table of contents is located in the document.
    • This can optionally consist of two inputs explicitly indicating which line the table of contents starts on, and which it ends on.
      • These should start at index 0, i.e. the first line in the document is 0, the second line is 1, and so on.
      • These should include the entire encompassing structure of the Dreamweaver-formatted table of contents, including the div tag that Dreamweaver usually puts the table into. Everything between these two lines, inclusive, will be removed and replaced with a WET-formatted div for the updated table of contents, so include all surrounding lines that you do not want to appear in the cleaned document.
    • From what I have seen, Dreamweaver usually formats the table of contents tables to be surrounded by two lines that consist of <br clear="all">. So if the start/end lines are not provided, then the tool searches for a block of text between two <br clear="all"> that contains at least two instances of entry separators (described in the first input).
  3. The initial string for a table of contents entry ID (to differentiate it from regular links).
    • By default, this is “toc_”. For conciseness, the rest of the documentation will assume the default value.
      • This means that, for example, an entry ID link may be formatted as “#toc_1”.
  4. How the tool should attempt to indent the output WET-formatted table of contents, decide header tag level, and generate IDs. The options are as follows:
    • use existing list numberings from the document (explanation in this section).
    • use manually inserted list numberings (differences from using existing list numberings explained in this section).
    • use existing indentation in the table. This is only helpful if the Dreamweaver table is formatted as a list, as described in the first input.
  5. The list type for the output WET-formatted table of contents. For example, if <ol lst-num> is selected, then the WET table of contents might be formatted as so:
    <ol class="lst-num">
        <li>Entry 1
            <ol class="lst-num">
                <li>Entry 2</li>
                <li>Entry 3</li>
            </ol>
        </li>
    </ol>
    
  6. The option for whether to remove table of contents page numbers from entries. When searching for page numbers, the tool expects them to be at the end of the table of contents entry, and to use one of the following two formats. If the page numbers follow a different format, then you should remove them from the document manually before running the tool.
    • at least two periods, followed by any number of spaces, followed by a number, such as the following:
      • <li>Entry........ 5</li>
    • at least one period, followed by at least one space, followed by a number, such as the following:
      • <li>Entry . 3</li>

Details

This tool has the three following functionalities:

  1. It formats the table containing the table of contents to fit WET standards. This includes adding in the surrounding div.
  2. It removes formatting tags (bold, italics, and so on) in the table of contents entries.
  3. It looks for tags/lines in the document with the same values as the entries in the table of contents, and converts them to header tags containing IDs to be linked to by the table of contents.

Individual entries in a WET table of contents table use formatting similar to this example:

<li><a href="#toc_3.1">3.1 Overview</a></li>

This entry should represent a header in the main document, which should then be linked to in the table of contents entry by ID. For the above entry, there would be the following header later in the document:

<h3 id="toc_3.1">3.1 Overview</h3>

Notice how the table of contents entry contains the link “#toc_3.1”, which links to this header with an ID of “toc_3.1”.

Misformatting introduced in functionality 3

False positives

Functionality 3 (where tags/lines in the main body of the document are converted to headers that are linked to by table of contents entries) is very likely to produce false positives because it replaces all tags and lines, except for <li> and <td> tags, that have the same value as each table of contents entry. For example, if there is a table of contents entry containing “3.1 Overview”, then if there are multiple paragraphs later in the document that consist solely of “3.1 Overview” or “Overview”, all of them will be replaced, even though only one of them can be the actual header.

These false positives will have to be manually fixed afterward. Since the IDs will be duplicated as well, this will show up as an error in the HTML structure (IDs have to be unique), which should make them easier to locate in Dreamweaver.

The tool adds a comment above every tag/line that is replaced which consists of the original value of the line. This is to help with figuring out whether the replaced line was a false positive or not.

<li> and <td> tags in particular are ignored by the tool in this step because headers are usually not formatted as list or table data items, so those tags are almost always false positives.

Poorly structured HTML

In addition, since the entire tag or line is replaced, the resulting HTML may not be well structured. For example, it may find the following lines:

<p>Overview<br>
  &nbsp;</p>

and only replace the first line, resulting in this:

<!-- Original tag: <p>Overview<br> -->
<h3 id="toc_3.1">3.1 Overview</h3>
  &nbsp;</p>

which contains an extra closing p tag. You will have to go through and fix any errors with the HTML structure yourself afterwards; the comments containing the original values above the lines that have been replaced should help with this as well.

Removing tags inside table of contents entries

When cleaning each table of contents entry, the following tags are removed because they are usually introduced unnecessarily when pasting a document into Dreamweaver, and/or may produce formatting errors if kept.

Lines for the table of contents

I have noticed that Dreamweaver formats its table of contents either as a p tag separated by br, or a ul list (corresponding to the options for the first input of this tool).

If this assumption is incorrect, then the tool will not work.

As mentioned earlier, I have noticed that the table of contents tables are usually surrounded by two lines that consist of <br clear="all">. So if no inputs are provided for the start/end line positions of the table of contents in the HTML document, then the tool searches for a block of text between two <br clear="all"> that contains at least two <br>, and uses that block of text as the table of contents.

If this does not properly find the lines of the HTML document that consist of the table of contents, then you should manually enter the start/end line positions instead.

Header IDs

If the option for list numbering is selected:

If the option for list numbering is not selected, then header IDs are all formatted as “toc_internal counter”, e,g, “toc_1”.

In either case, the internal counter increments at each table of contents entry that uses it, to keep the IDs unique.

List numbering

Most of the usefulness in the tool comes from its attempt at guessing the hierarchy of the table of contents, which it does using list numberings at the start of each entry. List numberings must be formatted with numbers and periods. The hierarchy of the table of contents is used for two things:

For indentation, each entry is compared to the previous entry to see whether it is higher or lower in the hierarchy; a sublist is created if it is higher, and the previous sublist is closed if it is lower.

For the header level, the tool checks how many times a period followed by a number appears in the list numbering. The lowest level is 2, for h2. For example:

Any initial list numberings are set to be optional in step 3’s regex statement that searches for tags/lines consisting of table of contents entries. So if a table of contents entry consists of “Overview”, both of the following tags would match:

Entries without list numbering

For entries that do not have list numberings, the level is set to be to be [the level of the last list numbering that did exist] + 1. If there have been no list numberings so far, then the tool uses a level of 2. For entries without a list numbering, the list numbering value itself is set to a blank string, so it will not be included in the entry.

For example, if the table of contents has the following entries:

Then the levels would be 2, 2, 3, 4, 5, 5, and 4.

Manual list numbering

Some documents may not have list numberings, or the list numberings may be misformatted (not formatted with numbers and periods), so the tool wouldn’t be able to indent the table of contents or choose header levels properly. In this case, you can manually add the list numberings into the Dreamweaver document after pasting it from Word.

For example, if you had the following table of contents entries:

and you wanted Definitions to be indented one level more than Introductions, then you could manually add in list numberings yourself:

If the option to remove manual list numbering is checked, then list numberings will be excluded from table of contents entry text, as well as the headers that replace tags/lines in step 3. So after generating the table of contents links and indentation with the manual list numberings, the table of contents entries would still be formatted as so:

Note that the list numberings will still be included as optional parts of the tag/line search regex used in step 3.

This option is ignored if the option to use list numberings isn’t also selected.

Tips

Implementation details

I have structured format_toc_helpers.js as follows:

ToC entry array formatting

For neatness, I split step 3 of the above implementation details into two helper functions. Only one of the two helper functions is called, depending on how the ToC indentation should be generated.

Both functions return an array with the same structure. Each value in the array is an object with four properties: - list numbering, which is extracted from the start of the entry’s content. - link id for the link to the header, created as described earlier. - indentation level. - the content of the entry, passed into clean_entry() to clean it.

These four properties are what is required to create a WET-formatted ToC entry. For example, suppose we have the following ToC entry:

<li><a href="#toc_3.1">3.1 Overview</a></li>

which produces the following header:

<h3>3.1 Overview</h3>

This array of objects containing ToC entry properties is then used for steps 4 and 5.