Friday, March 9, 2012

Using XQuery in MS SQL for Novices Part 1

A recent project at work presented me with a new challenge. Generally in a database, at least the ones that I've  had to work with, each different type of data is stored in its own column. Well, the project currently under test downloads information from a third party system in the form of an XML file. Even though we don't use every bit of data in provided by the third party, we still want to preserve it in its original format. Therefore we store the entire XML string in the database.

When XML data is stored in a database, there are at least a couple different data types that can be used.
  1. It could be stored as just regular old text using the varchar data type. In that case whatever program is using the data has to convert it back into XML in order to use it. The Umbraco content management system likes to do it this way.
  2. On the other hand, it could be stored natively using the XML data type, which is what our system does. Storing the data AS XML presents some interesting challenges as well as many opportunities when it comes to digging for information within the XML string itself -- most of which I'm still figuring out. For today, I'd like to start with just the basics.
Doing It Old School
If all you're trying to do is find the rows where a specific series of characters occurs, you can search just like you always do... almost. 
SELECT * FROM Table WHERE XMLColumn LIKE '%key-text%'
This traditional query will throw an error faster than a lazy developer's code. Instead, you need to change data types on the fly:
SELECT * FROM Table
WHERE CAST(XMLColumn AS varchar(max)) LIKE '%key-text%'
So to reiterate, this query uses the CAST command to convert the XML column into a text data type.

Being All Fancy About It
Like I said, the old school works fine if you're just trying to identify entire records in the table that match your criteria. If you're just looking for specific nodes in the XML data, the approach above is NOT very efficient. You can harness the power of the XML data type to do things much more quickly! Since v.2005, MS SQL Server has supported XQuery. I'm not a programmer so I don't yet understand the full power that this yields. I've found, however, that the existing help on the internet was not written for people like me so here's what I've learned so far:

STEP 1: Using query()
Assuming that we just want to find the data stored in a specific XML node, we can use the Query() function.
SELECT XMLColumn.query('//XML-Node-Name') FROM Table
A couple of notes about what's going on there:
  1. Query() is a function. Inside of the function is one ore more arguments (RAWR! J/K), which are written inside of the parentheses. Each argument must be enclosed in single quotes and multiple arguments are comma separated. The query() function will accept accept ONE argument.
  2. The double slashes // at the start of the argument tell the system that you're looking for a node and for our purposes, it doesn't matter whether it has child or parent nodes.
  3. You need to put the goods of the function inside of the single quotes.
  4. The node name is CASE SENSITIVE
  5. Don't capitalize the the function: e.g. type "query" not "Query"
When this query runs, unless you add a WHERE clause, it's going return a result for every single row in the table, including the rows where the specified node does not exist. If the node does not exist for that row, there will be an empty cell in the results table. If the node does exist, the data returned will look like this:
<xml-node-name>XML data stored in node</xml-node-name>
STEP 2: Using exists()
Because using query() alone can be a little messy returning empty rows, we can use another function that checks YES or NO does something exist in the XML data.
SELECT XMLColumn.query('//XML-Node-Name'FROM Table WHERE XMLColumn.exists('//XML-Node-Name') = 1
 This query will only return the rows where the specified node exists without any empty rows. A couple of notes about what's going on here:

  1. I used the same argument for both functions because if the node doesn't exist that I'm looking for, that row doesn't need to be returned.
  2. The = 1 part is telling it to evaluate to TRUE (e.g. Yes, it exists) whereas, if I set it to = 0 it would be the opposite (e.g. No, it doesn't exist). Setting it to 0 would only return the blank rows in this example.
Please let me know in the comments if you've found this useful. Look for more posts to come as I learn to harness more power of the XQuery!

Helpful but technical links (also probably the first ones you came across when googling "XQuery SQL Server":

2 comments: