Wikipedia:Wiki Markup Language

{{for|help with code used to build Wikipedia articles|Help:Wiki markup}}

Wiki Mark Up Language (WKML) is a group, started by Nick Irelan, that seeks to create an XML-based format for wikis so that the information they present may be used by third-party programs such as mashups.

This format should be created in a way that allows wikis to be syndicated similar to an RSS feed.

Request to define format and useful attributes

All editors with XML experience are welcome to comment on how they feel Wikipedia created content such as, but not limited to Infoboxes and Titles should be formatted.

= What is the broadly defined structure of any given article? =

All of the common parts of articles should receive special attention to make sure they can be worked with easily by a programmer. For example, the history section would probably be used a lot.

Title
Sections
Categories
Info boxes (?)

= What are the broadly defined attributes of any given article? =

Which, if any, attributes to the articles should be included in WKML?

Date created
Updates

Request to define discussion pages format

All editors with XML experience are welcome to comment on how they feel discussion pages should be formatted.

Basic principles: narrowly or broadly defined?

All xml formats represent a balance struck between two extremes: the format being defined very narrowly, and the format being defined very broadly. Each approach carries with it a series of advantages and challenges, and it is the job of a format designer to decide which approach's characteristics best suit the way that the format will be used.

Generally speaking, broadly defined formats are more usable, and in particular are more likely to be adopted. One of the reasons for the incredible success of RSS is that it is, eponymously, really simple, and easily expanded. The drawback of a broadly defined format is that it generally requires more work on the part of the developer when the consumer desires a specific application for a document. Consider the following two formats:

Narrow definition



George Washington
George Washington was a...

Broad definition


George Washington was a...

Obviously, the second article can be about anything. This is an advantage in that it fits the extremely loosely-structured nature of Wikipedia, but a disadvantage in that it makes development for specific formats more difficult. Ultimately, defining narrow formats would require articles to conform to templates at the very least, which would drastically limit the scope of articles would could be exposed via WKML. Ultimately, a format which can be applied to any Wikipedia article, no matter how minimally or bizarrely defined, is probably the most appropriate approach. If infobox or template specific information is available, it should be used, but cannot be required.

How the format will be used

XML exists to be consumed or transformed. A XML document might be consumed by a database in the form of an updategram, which leads to an update of a record in a table, or it might be transformed into a more readable format in the case of an RSS reader.

A pure consumption of the XML involves direct coding against the format, which is typically very narrowly defined. It seems unlikely that much Wikipedia content would be used this way, so this leaves transformation as the end use of WKML.

One approach would be to simply utilize RSS. The problem with this approach is that RSS is probably too broad, and probably doesn't suit the encyclopedic nature of Wikipedia. Ultimately, if RSS was required for a mashup (an RSS feed of new pages, or changes to a page, for example), as long as WKML meets some minimal requirements, it can be transformed into RSS.

A hybrid approach

Indeed, one way to solve this problem of specificity versus simplicity would be separate formats divided along, for example, template lines, with a roll down to a generic article format. This approach has the further advantage that it need not be implemented immediately; a basic WKML format could be created and deployed for widespread use, and then later function as the rolldown format when more specific formats are created. Ultimately, whether the broad or narrow format of the article is used can be an attribute of the request, with the rolldown format as the default for backwards compatibility.

Open questions

Are article content and metadata documents separate or are they unified into a single document?
* Content and history are definitely contained in the same document. Documents which represent previous revisions of the document are separate documents, retrieved by a separate syndication / service command.
What is the broadly defined structure of any given article? (Title, sections, ?)
* The broadly defined structure is as follows:
*# root, containing a root document element and some document meta-info such as category information.
*# title
*# content sections
*# history
What are the broadly defined attributes of any given article?(Title, date created, further date questions feed back into question #1)
How will infoboxes be handled?
*At this point, it's pretty clear that we have to renotate all of the infobox content.
How much and how will HTML be scrubbed from the content?
* A general principle which might be helpful is that markup that affects presentation (such as font-strength (bold) or italicization) will be scrubbed without re-notation, while markup that affects actions, such as links, will be renotated in a format that we have to determine. One exception to this will be images, which may need to be renotated, or may simple pass through as is.

Proposed date format

Wikipedia Markup Language documents use the date format specified in the ISO standard document [http://www.w3.org/TR/NOTE-datetime ISO 8601:1988(E)]. An important distinction:

Document node values or document attributes which are dates are subject to this standard, dates used in the content are not.

For example:

Standard applies

Standard does not apply

Brett Bretterson was born at 8:14 PM on Saturday, June 21st, 1954.

Differently grained attributes may require different formats specified within the standard, e.g., some dates may not require time. As per the standard, all times are UTC.

Examples

= Highly developed article in XML format =

(In Progress)