<Inside:WML>

What's in an Empty Word Document?

June 17, 2020

It’s a lot more than nothing

I created an empty document in each of Microsoft Word, LibreOffice Writer and Google Docs and saved it as a docx file. What’s inside?

First, we look at the components of these files. Then we look a bit more closely at a couple of them: the Main Story, and the Styles part. Finally, we create our own minimal, empty docx without the use of a word processor.

The Files

The first thing to note is that a docx file is an Open Packaging Conventions (OPC) container - in this case, a ZIP file. So we can see the contents by running unzip -l on the relevant docx. (Alternatively you could use Expand-Archive at a PowerShell prompt, or rename the file to end with .zip and use the File Explorer, to expand the file into its constituent parts.)

Let’s look at the Word file first:

Archive: .\EmptyWord.docx
Length Date Time Name
--------- ---------- ----- ----
1312 1980-01-01 00:00 [Content_Types].xml
590 1980-01-01 00:00 _rels/.rels
2658 1980-01-01 00:00 word/document.xml
817 1980-01-01 00:00 word/_rels/document.xml.rels
8393 1980-01-01 00:00 word/theme/theme1.xml
3096 1980-01-01 00:00 word/settings.xml
29135 1980-01-01 00:00 word/styles.xml
803 1980-01-01 00:00 word/webSettings.xml
1567 1980-01-01 00:00 word/fontTable.xml
751 1980-01-01 00:00 docProps/core.xml
709 1980-01-01 00:00 docProps/app.xml
--------- -------
49831 11 files

That’s a whole lot of nothing.

It’s easy to see that each file in the archive is either an xml file or a .rels file. The .rels files and the [Content_Types].xml file are part of the OPC standard and provide a way to specify relationships between files, and the types of content contained within the archive.

The other xml files describe the contents of the document using the WordprocessingML flavour of Office Open XML that is used by Microsoft Word to save the contents of a document. (I usually refer to Office Open XML as OOXML, and WordprocessingML as WML.)

What does the LibreOffice Writer archive contain?

Archive: .\EmptyLO.docx
Length Date Time Name
--------- ---------- ----- ----
573 2020-06-13 12:55 _rels/.rels
648 2020-06-13 12:55 docProps/core.xml
540 2020-06-13 12:55 docProps/app.xml
531 2020-06-13 12:55 word/_rels/document.xml.rels
1273 2020-06-13 12:55 word/document.xml
2333 2020-06-13 12:55 word/styles.xml
853 2020-06-13 12:55 word/fontTable.xml
208 2020-06-13 12:55 word/settings.xml
1374 2020-06-13 12:55 [Content_Types].xml
--------- -------
8333 9 files

Not dissimilar. But not precisely the same. There’s no word/webSettings.xml, and no word/theme/theme1.xml. (We also see that the LO archive has provided dates for the component files, whereas Word doesn’t, which is why the default date of 1 January 1980 shows up in the first listing.) The sizes are all over the place, although generally smaller.

What about GoogleDocs?

Archive: .\EmptyGD.docx
Length Date Time Name
--------- ---------- ----- ----
1341 2020-06-13 06:12 word/numbering.xml
1770 2020-06-13 06:12 word/settings.xml
1370 2020-06-13 06:12 word/fontTable.xml
4575 2020-06-13 06:12 word/styles.xml
1814 2020-06-13 06:12 word/document.xml
812 2020-06-13 06:12 word/_rels/document.xml.rels
298 2020-06-13 06:12 _rels/.rels
7643 2020-06-13 06:12 word/theme/theme1.xml
1069 2020-06-13 06:12 [Content_Types].xml
--------- -------
20692 9 files

Again no word/webSettings.xml, but this time we have a word/theme/theme1.xml and a word/numbering.xml, but neither a docProps/core.xml nor a docProps/app.xml. Docs also provides dates, and the sizes are generally chunkier than Writer’s.

The Main Story

The easiest way I know of to quickly explore the contents of the various files (which we might as well start referring to as ‘parts’) is using the OOXML Tools Chrome browser extension - simply drag and drop the docx file into the OOXML Tools tab and click away.

The word/document.xml part contains the ‘Main Story’ of the document - the text that doesn’t form part of a header, footer, endnote or footnote (which are stored in separate parts, none of which we have in this case).

Our Main Story is empty. So what does empty look like? Ignoring the XML namespace declarations, we find the empty paragraph (there is always a paragraph) looks like this:

  • Word:
<w:p w14:paraId="0FF30EA2" w14:textId="77777777" w:rsidR="00361EE8"
w:rsidRDefault="00C76DB8"/>
  • LO Writer
<w:p>
<w:pPr>
<w:pStyle w:val="Normal"/>
<w:bidi w:val="0"/>
<w:jc w:val="left"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r>
<w:rPr></w:rPr>
</w:r>
</w:p>
  • GoogleDocs
<w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
w:rsidRDefault="00000000" w:rsidRPr="00000000" w14:paraId="00000001">
<w:pPr>
<w:rPr/>
</w:pPr>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr>
<w:rtl w:val="0"/>
</w:rPr>
</w:r>
</w:p>

The Word paragraph is the simplest: an empty w:p element (w being the base namespace for WML, and p standing for … ‘paragraph’) with some ‘Id’ or ‘id’ attributes. These are used to keep track of revisions and editing sessions, and to assist more recent versions of Office (hence the w14 namespace) to keep track of text when it’s been cut and pasted within or between applications.

Wait - revisions? But nothing’s been revised. What are these revisions that are being tracked? Search me.

The LibreOffice Writer paragraph shows a bit more of the standard WML structure. It contains a Paragraph Properties element (w:pPr) that tells us the empty paragraph has the Normal style, the standard bidirectional status, and is justified left. (I’ll get to the last element shortly.) The paragraph contains a single Run (contiguous text with the same format), w:r, with no Run Properties (w:rPr), and no other content.

What are the Run Properties in the Paragraph Properties (the last element) referring to, then? The styling of the paragraph marker at the end of the paragraph. Really.

Finally, the GoogleDocs paragraph is a bit of a mixture of both the Word and the Writer paragraphs, in that it provides more structure (Paragraph Properties and a Run with Run Properties) and a sprinkling of Id attributes. In this case the Right To Left (w:rtl) property of the Run is used to tell us the non-existent text runs left to right.

In addition to the (empty) paragraph, each of the Main Stories contains a Section Properties (w:sectPr) element. This specifies certain attributes of the last, and only, section of the document: page size, margins, page numbering style, columns, text direction… Each of our three applications chooses a different subset of things to specify, however.

We can conclude, as regards the Main Story, that there are lots of partially overlapping ways of representing an empty document, and that they all seem to work. We’ll take advantage of this later.

The Styles Part

The biggest difference in size between our files seems to come down to the word/styles.xml part. For Writer it’s 2,333 bytes, for Docs 4,575 bytes and for Word itself a whopping 28,135 bytes - bigger than the whole of the Writer and Docs files (uncompressed) combined!

Each of the Styles Parts begins with a w:docDefaults element which specifies, as you would expect, some default values for Paragraph and Run Properties. To take LibreOffice Writer as an example, we have:

<w:docDefaults>
<w:rPrDefault>
<w:rPr>
<w:rFonts w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"
w:eastAsia="NSimSun" w:cs="Lucida Sans"/>
<w:kern w:val="2"/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:lang w:val="fr-CH" w:eastAsia="zh-CN" w:bidi="hi-IN"/>
</w:rPr>
</w:rPrDefault>
<w:pPrDefault>
<w:pPr>
<w:widowControl/>
</w:pPr>
</w:pPrDefault>
</w:docDefaults>

Here we see, for example, that Liberation Serif is the default font for Western script (w:ascii), but NSimSun (a Simplified Chinese font featuring mincho (serif) stroke style) for Asian/CJK (w:eastAsia), and that the default language is Swiss French (“fr-CH”) because I happened to save the file in Geneva.

The GoogleDocs version is simpler:

<w:docDefaults>
<w:rPrDefault>
<w:rPr>
<w:rFonts w:ascii="Arial" w:cs="Arial" w:eastAsia="Arial" w:hAnsi="Arial"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en"/>
</w:rPr>
</w:rPrDefault>
<w:pPrDefault>
<w:pPr>
<w:spacing w:line="276" w:lineRule="auto"/>
</w:pPr>
</w:pPrDefault>
</w:docDefaults>

From Google you get Arial, in 11pt (the Size w:sz and Complex Script Size w:szCs are given in half points) and English (“en”).

Interestingly, the Docs font and language defaults are specified at the Paragraph level - Writer specifies them at the Run level.

For completeness, Microsoft Word is a mix of the other two:

<w:docDefaults>
<w:rPrDefault>
<w:rPr>
<w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorHAnsi"
w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="fr-CH" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
</w:rPrDefault>
<w:pPrDefault>
<w:pPr>
<w:spacing w:after="160" w:line="259" w:lineRule="auto"/>
</w:pPr>
</w:pPrDefault>
</w:docDefaults>

Following the docDefaults the Styles Parts define a handful of w:styles. Writer defines Normal, Heading, TextBody, List, Caption and Index. Docs defines Normal, TableNormal, Heading1 to Heading6, Title and Subtitle. Word defines Normal, DefaultParagraphFont, TableNormal and NoList. (Once again, lots of variation and limited overlap.)

Let’s have a look at what Normal means.

  • Microsoft Word
<w:style w:type="paragraph" w:default="1" w:styleId="Normal">
<w:name w:val="Normal"/>
<w:qFormat/>
</w:style>

Ummm, OK. It’s the default paragraph format, it’s called “Normal” and it shows up in the UI (qFormat).

  • LibreOffice Writer
<w:style w:type="paragraph" w:styleId="Normal">
<w:name w:val="Normal"/>
<w:qFormat/>
<w:pPr>
<w:widowControl/>
<w:bidi w:val="0"/>
</w:pPr>
<w:rPr>
<w:rFonts w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"
w:eastAsia="NSimSun" w:cs="Lucida Sans"/>
<w:color w:val="auto"/>
<w:kern w:val="2"/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:lang w:val="fr-CH" w:eastAsia="zh-CN" w:bidi="hi-IN"/>
</w:rPr>
</w:style>

This at least has some content, although most of it seems to be repeated from the docDefaults (and now we have 12pt instead of 11).

  • GoogleDocs
<w:style w:type="paragraph" w:styleId="Normal" w:default="1">
<w:name w:val="normal"/>
</w:style>

Docs doesn’t even seem to be trying. And why the lowercase “normal”?

None of this explains why the Styles Part for Microsoft Word is so much bigger than the others. The answer is that Word includes almost 400 Latent Style entries - styles that are not used in the document, but that are known to the application producing it. These essentially contain information about the UI rather than the document content. A representative example reads:

<w:lsdException w:name="Grid Table 7 Colorful Accent 6" w:uiPriority="52"/>

Whether it is a good use of bits at rest or in motion to specify this sort of thing in every docx file is open to debate. In any event, it is because nearly 400 of these entries are included in the Styles Part of the Microsoft Word docx file that explains its larger size.

I Know What You Did Last Edit

Let’s turn our attention to the docProps parts. (Which GoogleDocs doesn’t have.)

There are two parts to look at, core and app. Unsurprisingly the app parts give us details about the application environment that created the document. See if you can guess which is which:

<Application>LibreOffice/6.3.5.2$Windows_X86_64 LibreOffice_project/
dd0751754f11728f69b42ee2af66670068624673</Application>

versus

<Application>Microsoft Office Word</Application>
<AppVersion>16.0000</AppVersion>

The core part contains metadata about the document, some elements conforming to the Dublin Core Metadata Initiative, which explains the dc and dcterms namespaces for those items. Other elements conform to the OPC standard and use the cp (for ‘Core Properties’) namespace.

In particular, the core part records the author(s) of the document as dc:creator and lastModifiedBy:

  • Word
<dc:creator>David Murray</dc:creator>
<cp:lastModifiedBy>David Murray</cp:lastModifiedBy>
  • LibreOffice
<dc:creator></dc:creator>
<cp:lastModifiedBy></cp:lastModifiedBy>

LibreOffice doesn’t record the information in the empty docx file (although it does record the language as Swiss French).

If we don’t want Word leaking this information we can use the Info blade of the backstage tab … I don’t really understand the UI lingo here. You select the File entry on the top bar of Word, then the Info entry in the left-hand side bar, then the Inspect Document entry on the Inspect Document tile dropdown, and you will discover that, indeed, the Author is recorded.

This information can be removed:

Document Properties

Pressing ‘Remove All’ and saving the subsequent file does give us empty dc:creator and cp:lastModifiedBy entries. (We may explore a way of discovering - and removing - this information in bulk in another post.)

A Minimal Empty Document

Given what we know about OPC, OOXML and what works in Word, can we create a minimal empty docx file?

Here’s one approach. We need three files:

  • [Content_Types].xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/xml" Extension="xml"/>
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Override ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" PartName="/document.xml"/>
</Types>

Here we describe the content contained by the other two files: package relationships (for rels files - or file, in our case) and a Main WordprocessingML (WML) Story.

  • _rels/.rels
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="/document.xml"/>
</Relationships>

Here we describe the the relationship between the package (because the relationship is contained in the package-level rels file) and the one remaining part: ‘OfficeDocument’. document.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p />
</w:body>
</w:document>

And finally we describe our empty Main Story - a single empty paragraph.

You will note that we have put our Main Story part at ‘/document.xml’ where the the word processors all put it at ‘/word/document.xml’. This shouldn’t matter, since the whole point of the rels parts indirection is provide this flexibility, although it is possible that some programs that purport to deal with docx files will choke. As we will see, neither Word nor Writer has any problems.

One way to get these parts into an appropriate package is to use PowerShell:

$ct = '<the [Content_Types].xml content goes here>'
$rels = '<the _rels/.rels content goes here>'
$doc = '<the document.xml content goes here>'
cd $env:temp
New-Item -Path "Minimal" -Type "Directory"
cd Minimal
New-Item -Path "_rels" -Type "Directory"
New-Item -Path '[Content_Types].xml' -Type "File" -Value $ct
New-Item -Path '_rels/.rels' -Type "File" -Value $rels
New-Item -Path 'document.xml' -Type "File" -Value $doc
Compress-Archive -Path *.xml,_rels/ -DestinationPath Minimal.docx
& .\Minimal.docx

The above code moves us into a temporary directory where we create a new ‘Minimal’ directory (to make cleanup easier), in which we create new files with the contents of our three parts. We then Compress these into a new Zip archive called Minimal.docx, which we open with the default application (the & .\Minimal line). Assuming this is Microsoft Word, we find our wonderful, empty and working docx file.

Phew!

Conclusion

Microsoft Word docx files turn out to be complex beasts, but not so complex that we can’t build them ourselves. Obviously typing in the xml on the command line isn’t a solution that’s going to scale, but it shows that, with care, we can manipulate files and still hope to edit them in our favourite word processor (or send them to others to do so).

Otherwise, we’ve dipped our toes into what makes up a docx file. We’ve seen that style information is (or can be) kept separate from the Main Story, and we’ve seen where (lots of) other parts can be added. We’ll look at some of this in other posts.

Finally, we’ve seen (or if you haven’t played along, you can take my word for it) that OOXML Tools can be a wonderful way of seeing what’s inside those docx files - empty or not.


Comments, questions or suggestions? Email me.


David Murray is an old^W experienced in-house lawyer (and amateur smug lisp weenie) who likes to explore personal-scale legaltech. You could follow him on Twitter

© 2020 <Inside:WML>