XML fundamentals

Definition

XML

= extensible markup language

W3C standard for document markup

  • structural and semantic language / document type

    static documents that don’t do anything:

    not a programming language, network protocol

    not a database but can be stored in databases

  • plain-text

    portable data, human-readable, machine-readable

  • application-specific, extensible

    no fixed set of tags

    can be extended to different needs

  • parsing

    content gets parsed off of document

    must be well-formd / syntactically correct

HTML

XML ≠ HTML

  • presentation language
  • Fixed set of tags with predefined meanings
  • not extensible

    only used for web pages

Fundamentals

Elements and Tags

element = tags + content

  • content can be empty, consist of text, elements or be mixed
  • tags are case-sensitive

Attributes

<tagName attributeName="attributeValue">

Tags can have multiple attributes

  • attribute order is not significant
  • attribute names must be unique

allowed names

XML names = element names, attribute names, construct names

  • alphanumeric (but foreign letters allowed)
  • must start with letter or underscore

    some constraints for :

  • must not start with xml (independent of casing)
  • no size limit

character references

content must not contain < , & but can use character references instead)

(new definitions can be added)

Mandatory:

&lt; for <

&amp; for &

Optional:

&gt; for >

&quot; for "

&apos; for '

Comments

<!-- comment -->

not element

comment must not contain --

Processing Instructions

<?target instruction?>

ie. <?xml-stylesheet href="course.css" type="text/css"?>

not element

used to to pass information to applications

the target is the XML name of the application or an identifier

XML Declaration

first thing in the document

not element, not processing instruction

should begin with xml (but optional)

ie. <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

version used XML version in document

encoding used character encoding - default is UTF-8

standalone whether the document uses external declarations - default is no