June 17, 2020
I created an empty document in each of Microsoft Word, LibreOffice Writer
and Google Docs and saved it as a docx
file. What’s inside?
First, we look at the components of these files. Then we look a bit more
closely at a couple of them: the Main Story, and the Styles part. Finally,
we create our own minimal, empty docx
without the use of a word processor.
The first thing to note is that a docx
file is an Open Packaging
Conventions (OPC)
container - in this case, a ZIP file. So we can see the contents by running
unzip -l
on the relevant docx
. (Alternatively you could use Expand-Archive
at a PowerShell prompt, or rename the file to end with .zip
and use the
File Explorer, to expand the file into its constituent parts.)
Let’s look at the Word file first:
Archive: .\EmptyWord.docxLength Date Time Name--------- ---------- ----- ----1312 1980-01-01 00:00 [Content_Types].xml590 1980-01-01 00:00 _rels/.rels2658 1980-01-01 00:00 word/document.xml817 1980-01-01 00:00 word/_rels/document.xml.rels8393 1980-01-01 00:00 word/theme/theme1.xml3096 1980-01-01 00:00 word/settings.xml29135 1980-01-01 00:00 word/styles.xml803 1980-01-01 00:00 word/webSettings.xml1567 1980-01-01 00:00 word/fontTable.xml751 1980-01-01 00:00 docProps/core.xml709 1980-01-01 00:00 docProps/app.xml--------- -------49831 11 files
That’s a whole lot of nothing.
It’s easy to see that each file in the archive is either an xml
file or
a .rels
file. The .rels
files and the [Content_Types].xml
file are
part of the OPC standard and provide a way to specify relationships between
files, and the types of content contained within the archive.
The other xml
files describe the contents of the document using the
WordprocessingML flavour of Office Open XML
that is used by Microsoft Word to save the contents of a document.
(I usually refer to Office Open XML as OOXML, and WordprocessingML as WML.)
What does the LibreOffice Writer archive contain?
Archive: .\EmptyLO.docxLength Date Time Name--------- ---------- ----- ----573 2020-06-13 12:55 _rels/.rels648 2020-06-13 12:55 docProps/core.xml540 2020-06-13 12:55 docProps/app.xml531 2020-06-13 12:55 word/_rels/document.xml.rels1273 2020-06-13 12:55 word/document.xml2333 2020-06-13 12:55 word/styles.xml853 2020-06-13 12:55 word/fontTable.xml208 2020-06-13 12:55 word/settings.xml1374 2020-06-13 12:55 [Content_Types].xml--------- -------8333 9 files
Not dissimilar. But not precisely the same. There’s no word/webSettings.xml
,
and no word/theme/theme1.xml
. (We also see that the LO archive has provided
dates for the component files, whereas Word doesn’t, which is why the default
date of 1 January 1980 shows up in the first listing.) The sizes are all over
the place, although generally smaller.
What about GoogleDocs?
Archive: .\EmptyGD.docxLength Date Time Name--------- ---------- ----- ----1341 2020-06-13 06:12 word/numbering.xml1770 2020-06-13 06:12 word/settings.xml1370 2020-06-13 06:12 word/fontTable.xml4575 2020-06-13 06:12 word/styles.xml1814 2020-06-13 06:12 word/document.xml812 2020-06-13 06:12 word/_rels/document.xml.rels298 2020-06-13 06:12 _rels/.rels7643 2020-06-13 06:12 word/theme/theme1.xml1069 2020-06-13 06:12 [Content_Types].xml--------- -------20692 9 files
Again no word/webSettings.xml
, but this time we have a word/theme/theme1.xml
and a word/numbering.xml
, but neither a docProps/core.xml
nor a
docProps/app.xml
. Docs also provides dates, and the sizes are generally
chunkier than Writer’s.
The easiest way I know of to quickly explore the contents of the various
files (which we might as well start referring to as ‘parts’) is using
the OOXML Tools
Chrome browser extension - simply drag and drop the docx
file into the
OOXML Tools tab and click away.
The word/document.xml
part contains the ‘Main Story’ of the document - the
text that doesn’t form part of a header, footer, endnote or footnote (which
are stored in separate parts, none of which we have in this case).
Our Main Story is empty. So what does empty look like? Ignoring the XML namespace declarations, we find the empty paragraph (there is always a paragraph) looks like this:
<w:p w14:paraId="0FF30EA2" w14:textId="77777777" w:rsidR="00361EE8"w:rsidRDefault="00C76DB8"/>
<w:p><w:pPr><w:pStyle w:val="Normal"/><w:bidi w:val="0"/><w:jc w:val="left"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr></w:r></w:p>
<w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"w:rsidRDefault="00000000" w:rsidRPr="00000000" w14:paraId="00000001"><w:pPr><w:rPr/></w:pPr><w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"><w:rPr><w:rtl w:val="0"/></w:rPr></w:r></w:p>
The Word paragraph is the simplest: an empty w:p
element (w
being the base
namespace for WML, and p
standing for … ‘paragraph’) with some ‘Id’ or ‘id’
attributes. These are used to keep track of revisions and editing sessions,
and to assist more recent versions of Office (hence the w14
namespace)
to keep track of text when it’s been cut and pasted within or between
applications.
Wait - revisions? But nothing’s been revised. What are these revisions that are being tracked? Search me.
The LibreOffice Writer paragraph shows a bit more of the standard WML structure.
It contains a Paragraph Properties element (w:pPr
) that tells us the empty
paragraph has the Normal style, the standard bidirectional status, and is
justified left. (I’ll get to the last element shortly.) The paragraph contains
a single Run (contiguous text with the same format), w:r
, with no Run
Properties (w:rPr
), and no other content.
What are the Run Properties in the Paragraph Properties (the last element) referring to, then? The styling of the paragraph marker at the end of the paragraph. Really.
Finally, the GoogleDocs paragraph is a bit of a mixture of both the Word
and the Writer paragraphs, in that it provides more structure (Paragraph
Properties and a Run with Run Properties) and a sprinkling of Id attributes.
In this case the Right To Left (w:rtl
) property of the Run is used to tell
us the non-existent text runs left to right.
In addition to the (empty) paragraph, each of the Main Stories contains
a Section Properties (w:sectPr
) element. This specifies certain attributes
of the last, and only, section of the document: page size, margins, page
numbering style, columns, text direction… Each of our three applications
chooses a different subset of things to specify, however.
We can conclude, as regards the Main Story, that there are lots of partially overlapping ways of representing an empty document, and that they all seem to work. We’ll take advantage of this later.
The biggest difference in size between our files seems to come down to
the word/styles.xml
part. For Writer it’s 2,333 bytes, for Docs 4,575 bytes
and for Word itself a whopping 28,135 bytes - bigger than the whole of the
Writer and Docs files (uncompressed) combined!
Each of the Styles Parts begins with a w:docDefaults
element which specifies,
as you would expect, some default values for Paragraph and Run Properties. To
take LibreOffice Writer as an example, we have:
<w:docDefaults><w:rPrDefault><w:rPr><w:rFonts w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"w:eastAsia="NSimSun" w:cs="Lucida Sans"/><w:kern w:val="2"/><w:sz w:val="24"/><w:szCs w:val="24"/><w:lang w:val="fr-CH" w:eastAsia="zh-CN" w:bidi="hi-IN"/></w:rPr></w:rPrDefault><w:pPrDefault><w:pPr><w:widowControl/></w:pPr></w:pPrDefault></w:docDefaults>
Here we see, for example, that Liberation Serif is the default font
for Western script (w:ascii
), but NSimSun (a Simplified Chinese font
featuring mincho (serif) stroke style) for Asian/CJK (w:eastAsia
),
and that the default language is Swiss French (“fr-CH”) because I
happened to save the file in Geneva.
The GoogleDocs version is simpler:
<w:docDefaults><w:rPrDefault><w:rPr><w:rFonts w:ascii="Arial" w:cs="Arial" w:eastAsia="Arial" w:hAnsi="Arial"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en"/></w:rPr></w:rPrDefault><w:pPrDefault><w:pPr><w:spacing w:line="276" w:lineRule="auto"/></w:pPr></w:pPrDefault></w:docDefaults>
From Google you get Arial, in 11pt (the Size w:sz
and Complex Script Size
w:szCs
are given in half points) and English (“en”).
Interestingly, the Docs font and language defaults are specified at the Paragraph level - Writer specifies them at the Run level.
For completeness, Microsoft Word is a mix of the other two:
<w:docDefaults><w:rPrDefault><w:rPr><w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorHAnsi"w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="fr-CH" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr></w:rPrDefault><w:pPrDefault><w:pPr><w:spacing w:after="160" w:line="259" w:lineRule="auto"/></w:pPr></w:pPrDefault></w:docDefaults>
Following the docDefaults
the Styles Parts define a handful of w:style
s.
Writer defines Normal, Heading, TextBody, List, Caption and Index. Docs
defines Normal, TableNormal, Heading1 to Heading6, Title and Subtitle.
Word defines Normal, DefaultParagraphFont, TableNormal and NoList. (Once
again, lots of variation and limited overlap.)
Let’s have a look at what Normal means.
<w:style w:type="paragraph" w:default="1" w:styleId="Normal"><w:name w:val="Normal"/><w:qFormat/></w:style>
Ummm, OK. It’s the default paragraph format, it’s called “Normal” and it
shows up in the UI (qFormat
).
<w:style w:type="paragraph" w:styleId="Normal"><w:name w:val="Normal"/><w:qFormat/><w:pPr><w:widowControl/><w:bidi w:val="0"/></w:pPr><w:rPr><w:rFonts w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"w:eastAsia="NSimSun" w:cs="Lucida Sans"/><w:color w:val="auto"/><w:kern w:val="2"/><w:sz w:val="24"/><w:szCs w:val="24"/><w:lang w:val="fr-CH" w:eastAsia="zh-CN" w:bidi="hi-IN"/></w:rPr></w:style>
This at least has some content, although most of it seems to be repeated
from the docDefaults
(and now we have 12pt instead of 11).
<w:style w:type="paragraph" w:styleId="Normal" w:default="1"><w:name w:val="normal"/></w:style>
Docs doesn’t even seem to be trying. And why the lowercase “normal”?
None of this explains why the Styles Part for Microsoft Word is so much bigger than the others. The answer is that Word includes almost 400 Latent Style entries - styles that are not used in the document, but that are known to the application producing it. These essentially contain information about the UI rather than the document content. A representative example reads:
<w:lsdException w:name="Grid Table 7 Colorful Accent 6" w:uiPriority="52"/>
Whether it is a good use of bits at rest or in motion to specify this sort
of thing in every docx
file is open to debate. In any event, it is
because nearly 400 of these entries are included in the Styles Part of
the Microsoft Word docx
file that explains its larger size.
Let’s turn our attention to the docProps
parts. (Which GoogleDocs doesn’t
have.)
There are two parts to look at, core
and app
. Unsurprisingly the app
parts give us details about the application environment that created the
document. See if you can guess which is which:
<Application>LibreOffice/6.3.5.2$Windows_X86_64 LibreOffice_project/dd0751754f11728f69b42ee2af66670068624673</Application>
versus
<Application>Microsoft Office Word</Application><AppVersion>16.0000</AppVersion>
The core
part contains metadata about the document, some elements conforming
to the Dublin Core Metadata Initiative, which explains
the dc
and dcterms
namespaces for those items. Other elements conform to
the OPC standard and use the cp
(for ‘Core Properties’) namespace.
In particular, the core
part records the author(s) of the document as
dc:creator
and lastModifiedBy
:
<dc:creator>David Murray</dc:creator><cp:lastModifiedBy>David Murray</cp:lastModifiedBy>
<dc:creator></dc:creator><cp:lastModifiedBy></cp:lastModifiedBy>
LibreOffice doesn’t record the information in the empty docx
file (although
it does record the language as Swiss French).
If we don’t want Word leaking this information we can use the Info blade of the backstage tab … I don’t really understand the UI lingo here. You select the File entry on the top bar of Word, then the Info entry in the left-hand side bar, then the Inspect Document entry on the Inspect Document tile dropdown, and you will discover that, indeed, the Author is recorded.
This information can be removed:
Pressing ‘Remove All’ and saving the subsequent file does give
us empty dc:creator
and cp:lastModifiedBy
entries. (We may explore a
way of discovering - and removing - this information in bulk in another
post.)
Given what we know about OPC, OOXML and what works in Word, can we create
a minimal empty docx
file?
Here’s one approach. We need three files:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><Typesxmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default ContentType="application/xml" Extension="xml"/><Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/><Override ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" PartName="/document.xml"/></Types>
Here we describe the content contained by the other two files: package
relationships (for rels
files - or file, in our case) and a Main
WordprocessingML (WML) Story.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><Relationshipsxmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="/document.xml"/></Relationships>
Here we describe the the relationship between the package (because the
relationship is contained in the package-level rels
file) and the one
remaining part: ‘OfficeDocument’.
document.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:documentxmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:body><w:p /></w:body></w:document>
And finally we describe our empty Main Story - a single empty paragraph.
You will note that we have put our Main Story part at ‘/document.xml’ where
the the word processors all put it at ‘/word/document.xml’. This shouldn’t
matter, since the whole point of the rels
parts indirection is provide
this flexibility, although it is possible that some programs that purport
to deal with docx
files will choke. As we will see, neither Word nor
Writer has any problems.
One way to get these parts into an appropriate package is to use PowerShell:
$ct = '<the [Content_Types].xml content goes here>'$rels = '<the _rels/.rels content goes here>'$doc = '<the document.xml content goes here>'cd $env:tempNew-Item -Path "Minimal" -Type "Directory"cd MinimalNew-Item -Path "_rels" -Type "Directory"New-Item -Path '[Content_Types].xml' -Type "File" -Value $ctNew-Item -Path '_rels/.rels' -Type "File" -Value $relsNew-Item -Path 'document.xml' -Type "File" -Value $docCompress-Archive -Path *.xml,_rels/ -DestinationPath Minimal.docx& .\Minimal.docx
The above code moves us into a temporary directory where we create a
new ‘Minimal’ directory (to make cleanup easier), in which we create
new files with the contents of our three parts. We then Compress these
into a new Zip archive called Minimal.docx, which we open with the
default application (the & .\Minimal
line). Assuming this is
Microsoft Word, we find our wonderful, empty and working docx
file.
Phew!
Microsoft Word docx
files turn out to be complex beasts, but not so
complex that we can’t build them ourselves. Obviously typing in the xml
on the command line isn’t a solution that’s going to scale, but it shows
that, with care, we can manipulate files and still hope to edit them in our
favourite word processor (or send them to others to do so).
Otherwise, we’ve dipped our toes into what makes up a docx
file. We’ve
seen that style information is (or can be) kept separate from the Main
Story, and we’ve seen where (lots of) other parts can be added. We’ll look
at some of this in other posts.
Finally, we’ve seen (or if you haven’t played along, you can take my
word for it) that OOXML Tools
can be a wonderful way of seeing what’s inside those docx
files -
empty or not.
Comments, questions or suggestions? Email me.
David Murray is an old^W experienced in-house lawyer (and amateur smug lisp weenie) who likes to explore personal-scale legaltech. You could follow him on Twitter