VIM: A journey across XML and regexps

(March 2013)

VIM: A journey across XML and regexps

For the TL;DR crowd: I worked with XML recently, so I enhanced my VIM to (a) automatically invoke SAXCount with ':make' and validate the currently opened .xml file, with automatic navigation to error lines (just as VIM does for C/C++), and (b) to automatically align element attributes of any visually selected block.

To see it in action, just watch the video below - in fullscreen 720p quality (click on the video window, then select the 720p version from the settings icon near the bottom-right, then click on the rightmost icon to make it fullscreen).

Editing XML files in vim

Over the last couple of months, I've been building a set of code generators. They work from an XML file - and after reading it, they generate... stuff.

Lots of stuff.

The reason I went with .xml/.xsd files this time - and didn't design my own domain-specific language - is a simple one: in this case, the resulting "language" and tools will be used by non-programmers. These people must therefore be able to work in something resembling an IDE - with auto-completion a mandatory requirement.

In combination with editors like Eclipse / Visual Studio, .xsd files cover this need quite well. As the analysts create the .xmls that are fed into my code generators, these monster IDEs guide them - showing what they are allowed to enter at each point in the .xml file, highlighting errors, etc.

If you write your own DSL, getting up to this point is a lot more difficult (you basically have to write your own IDE).

So all went well. I created my code generators, people started creating .xmls, and marvelous, working things came out of them.

Mostly.

Validation

You see, you can never trust your input. Ever.

I therefore had to bulk-validate the .xml files - and found the best, strictest checks to be performed by SAXCount, a part of the Xerces XML parser:

$ SAXCount -n -s -f *xml
Error at file /var/tmp/a.xml, line 4, char 23
  Message: empty content is not valid for content model '(transferBatch|notification)'
Error at file /var/tmp/b.xml, line 8, char 33
...

I tried other validators, too - and SAXCount seemed to be the most robust one. It caught things that others didn't, so long as the file begun with a reference to the .xsd:

<?xml version="1.0" encoding="utf-8" ?>
<Genesis
    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="Genesis.xsd">
    <Item ...>
    ...

Being a VIM guy, I wondered...

If only there was a way to easily navigate inside the errors of each .xml file, jumping immediately with the F4 function key from each error to the next... with the error info displayed at the bottom line of my editor.

Just as VIM does for C and C++, that is. And for Python (with Syntastic installed).

Alignment

Moreover, while debugging, I had to quickly identify parts of the .xml files. I found the... misaligned aspect of element attributes to be anything but helpful:

<Item param="STR_NAME_GTE" label="Name from:" pw="2:10" />
<Item param="D_APPLOGGED_DATE" label="Date you logged in:" pw="62:10" />
<Item param="I_MINID" label="Serial:" pw="2:10" />
<Item param="I_MAX_SID" label="Up to serial ID:" pw="62:10" ... />
<Item param="BD_MINPRICE" label="Price:" pw="2:30" />

Imagine debugging hundreds of such lines - rearranging the attributes would help immensely in visually locating what is where:

<Item param="STR_NAME_GTE"     label="Name from:"          pw="2:10" />
<Item param="D_APPLOGGED_DATE" label="Date you logged in:" pw="62:10" />
<Item param="I_MINID"          label="Serial:"             pw="2:10" />
<Item param="I_MAX_SID"        label="Up to serial ID:"    pw="62:10" ... />
<Item param="BD_MINPRICE"      label="Price:"              pw="2:30" />

So how does one go about implementing this functionality in VIM?

Adding SAXCount validation

Spawning an external tool from within VIM is easy. However, I wanted much more than just that; I wanted the same functionality I have for :make (which I've mapped to the function key F7) - that is, errors shown in the error list window, and me navigating from one to the next with F4 (which I've mapped to :cnext).

So I created a saxcount folder under my .vim/bundle, and wrote the following two lines in my saxcount/ftplugin/xml.vim:

se errorformat=%E,%C%.%#Error\ at\ file\ %f%.\ line\ %l%.\ char\ %c,
    %C\ \ Message:\ %m,%Z,%-G%f:\ %*[0-9]\ ms\ %.%#
se makeprg=SAXCount\ -n\ -s\ -f\ %

How did I get there?

Well, the second line is easy: se makeprg=SAXCount\ -n\ -s\ -f\ % - makes my F7 (mapped to :make) invoke SAXCount instead of make.

The magic errorformat line is another story :‑)

It is supposed to catch error messages like these:

$ SAXCount -n -s -f a.xml
Error at file /var/tmp/a.xml, line 4, char 23
  Message: empty content is not valid for content model '(transferBatch|notification)'

... or Fatal errors, that similarly begin with "Fatal Error" instead of "Error":

Fatal Error at file ...

Breaking down the two rules of my errorformat, this is the first one ...

se errorformat=
    // Error report span in multiple lines, begins with %E, ends with %Z)
    %E,%C%.%#Error\ at\ file\ %f%.\ line\ %l%.\ char\ %c,%C\ \ Message:\ %m,%Z,

... which works as follows:

%E  // begin multiline match of an error report
,   // end of first line from SAXCount, which is always empty
%C  // continuation - next line
%.%#Error...
    // which matches '.*Error...' - so it also catches "Fatal Error..."
%f%.
    // filename, followed by any char - in this case, the comma,
    // I could not use '\,' so I just used a '%.'
%l and %c 
    // similarly, line and column number
%C
    // continuation - next line
Message: %m
    // matches the actual message for the copen list
%Z
    // end multiline match

The second errorformat rule ignores (hence the minus in %-G) the informational lines emitted by SAXCount:

a.xml: 11 ms (64 elems, 207 attrs, 1133 spaces, 0 chars)

...via this:


%-G%f:\ %*[0-9]\ ms\ %.%#
// basically: filename, colon, space, numbers, space, "ms", and ".*"

And now, all I have to do to validate .xml files is :make (or just hit F7), and navigate from each error to the next with F4 (:cnext) - just as I do for my Python and C++ work.

One down, one to go.

Aligning element attributes

The end result: after visually selecting an area, I use the Leader key ( \ ) followed by '=', and attributes will line up - because of this line I added in my .vimrc:

vmap <buffer> <Leader>=
    :Tabularize/\v\zs\w+\ze\=["']<CR>
    gv:!eatPeskySpacesOfTabularizedXML.pl<CR>

...with eatPeskySpacesOfTabularizedXML.pl containing this:

#!/usr/bin/perl
while(<>) {
    s,(\w+)(\s*) =\s*(["'])((?:(?!\3).)*)\3,$1$2=$3$4$3,g;
    print;
}

There's a lot of interesting backstory in this, though. Keep reading.

The way of the `Tabular`

As is almost always the case, the necessary VIM plugin is just a Google search away. In my case, searching for 'vim alignment' pointed to Tabular.

So assuming you set markers a and b to the beginning and end of the section below...

<Item param="STR_NAME_GTE" label="Name from:" pw="2:10" />
<Item param="D_APPLOGGED_DATE" label="Date you logged:" pw="62:10" />
<Item param="I_MINID" label="Serial:" pw="2:10" />
<Item param="I_MAX_SID" label="Up to serial:" pw="62:10" nl="true" />
<Item param="BD_MINPRICE" label="Price:" pw="2:30" />

...this:

:'a,'bTabularize /=

...gets you this:

<Item param = "STR_NAME_GTE" label     = "Name from:" pw       = "2:10" />
<Item param = "D_APPLOGGED_DATE" label = "Date you logged:" pw = "62:10" />
<Item param = "I_MINID" label          = "Serial:" pw          = "2:10" />
<Item param = "I_MAX_SID" label        = "Up to serial:" pw    = "62:10" nl = "true" />
<Item param = "BD_MINPRICE" label      = "Price:" pw           = "2:30" />

Which is nice, but not what I wanted. Skimming over the Tabular manual, 5 min later:

:'a,'bTabularize/\v\zs\w+\ze\=["']

...gave me this:

<Item param ="STR_NAME_GTE"     label ="Name from:"       pw ="2:10" />
<Item param ="D_APPLOGGED_DATE" label ="Date you logged:" pw ="62:10" />
<Item param ="I_MINID"          label ="Serial:"          pw ="2:10" />
<Item param ="I_MAX_SID"        label ="Up to serial:"    pw ="62:10"    nl ="true" />
<Item param ="BD_MINPRICE"      label ="Price:"           pw ="2:30" />

...which is almost perfect.

Breaking down the regexp to see how this works:

\v\zs\w+\ze\=["']

\v: enter very magic mode (mostly Perl-ish regular expressions)
\zs: set start of match here
\w+: match a word (the attribute name, e.g. param or label)
\ze: set end of match here
...followed by an equal sign and any kind of quote.

Tabular will then place a single space before and after every match, making sure the matches line up across lines.

So, are we done?

The space before the equal sign

No, there's that pesky space before the equal sign. I am weird, I know :‑)

How would I go about removing it?

A simple regexp search and replace (s/ ="/="/g) would do the trick - but what if the strings end up containing equal signs in them? e.g.

posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

We would then break them up. No, we should search for the string beginning more cleverly - taking into account that XML strings can in fact use single quoting, too.

Let's hunt them down:

/\w+\s* =\s*(["'])[^\1]*\1

In detail:

\w+: match the attribute name
\s*: followed by optional whitespace
=: followed by a single space and the equal sign
\s*: followed by optional whitespace
(["']): followed by either kind of quote, which we mark...
[^\1]*: ...so that we can search for any character except it as many times as possible
\1: followed by the quote that we begun with in the first place.

Should work, no?

Well... it doesn't.

Why?

I couldn't figure it out. So I asked the all-knowing Oracle for help.

A kind soul there explained that the negation I am using ([^\1]) doesn't work. Apparently, you can't use back references in character classes - they simply don't work there.

But you can use ... look-ahead. To make sure the character that follows is NOT part of a back reference.

So what I want can be expressed like this, in regular expression engines that support look-ahead (like Perl's):

/\w+\s* =\s*(["'])((?!\1).)*\1

The new parts:

?!\1: look ahead, and make sure we don't match the back reference (the quote we've seen before)
.: Now that we know we don't, match any character
*: Do this as many times as possible
\1: followed by the quote that we begun with in the first place.

In fact, since we don't want to store the lookahead (which will happen for all characters in the strings, so it will be costly), we can use the ?: syntax to stop their memorizing.

And this is how my journey ended:

s,(\w+)(\s*) =\s*(["'])((?:(?!\3).)*)\3,$1$2=$3$4$3,g;

I placed a Perl script doing this in my utilities and invoke it right after Tabularize.

Here's the code

You can fork my VIM configuration in GitHub to automatically use these two tricks, if you think they are useful.

One thing is certain: I learned a lot while making them work.

If you liked this article, you'll probably also appreciate this one.

Index

Updated: Sat Oct 8 11:41:25 2022

The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable; or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.