XML
XML is a significant markup language mainly intended as a means of serialising data structures as a text document. Go has basic support for XML document processing.
XML是一种重要的标记语言,旨在把数据结构序列化成文本文档。Go基本支持XML文档处理。
Introduction
介绍
XML is now a widespread way of representing complex data structures serialised into text format. It is used to describe documents such as DocBook and XHTML. It is used in specialised markup languages such as MathML and CML (Chemistry Markup Language). It is used to encode data as SOAP messages for Web Services, and the Web Service can be specified using WSDL (Web Services Description Language).
现在XML是一个用序列化的文本格式表现复杂数据结构的普遍方式。它被用来描述文档例如DocBook和XHTML。它还用于描述专用标记语言如MathML和CML(化学标记语言)。Web服务中它还用来将数据编码成SOAP消息,Web服务也可以指定使用WSDL(Web服务描述语言)。
At the simplest level, XML allows you to define your own tags for use in text documents. Tags can be nested and can be interspersed with text. Each tag can also contain attributes with values. For example,
在最简单的层次上,XML允许您定义您个人标记用于文本文档。标签可以嵌套,也穿插在文本里。每个标记也可以包含属性与值。例如,
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>
The structure of any XML document can be described in a number of ways:
任何XML文档的结构可以用多种方式描述:
- A document type definition DTD is good for describing structure
- XML schema are good for describing the data types used by an XML document
- RELAX NG is proposed as an alternative to both
- 一个文档类型定义DTD有利于表现数据结构
- 在一个XML文档中,使用XML模式有利于描述数据类型
- RELAX NG提出了替代方案
There is argument over the relative value of each way of defining the structure of an XML document. We won't buy into that, as Go does not suport any of them. Go cannot check for validity of any document against a schema, but only for well-formedness.
人们总会争论定义XML文档结构的每一个方式的好坏。我们不会陷入其中,因为Go不支持其中任何一个。Go不能检查任何文档模式的有效性,但只知道良构性。
Four topics are discussed in this chapter: parsing an XML stream, marshalling and unmarshalling Go data into XML, and XHTML.
在本章中讨论四个主题:解析一个XML流,编组和解组Go数据成为XML和XHTML。
Parsing XML
解析XML
Go has an XML parser which is created using NewParser
. This takes an io.Reader
as parameter and returns a pointer to Parser
. The main method of this type is Token
which returns the next token in the input stream. The token is one of the types StartElement
, EndElement
, CharData
, Comment
, ProcInst
or Directive
.
Go有一个使用 NewParser
.创建的XML解析器。这需要一个io.Reader
作为参数,并返回一个指向Parser
的指针。这个类型的主要方法是 Token
,这个方法返回输入流中的下一个标记。该标记是 StartElement
, EndElement
, CharData
, Comment
, ProcInst
和Directive
其中一种。
The types are
这些类有
StartElement
- The type
StartElement
is a structure with two field types:
StartElement
类型是一个包含两个字段的结构:
type StartElement struct {
Name Name
Attr []Attr
}
type Name struct {
Space, Local string
}
type Attr struct {
Name Name
Value string
}
EndElement
- This is also a structure
同样也是一个结构
type EndElement struct {
Name Name
}
CharData
- This type represents the text content enclosed by a tag and is a simple type
这个类表示一个被标签包住的文本内容,是一个简单类。
type CharData []byte
Comment
- Similarly for this type
这个类也很简洁
type Comment []byte
ProcInst
- A ProcInst represents an XML processing instruction of the form
一个ProcInst表示一个XML处理指令形式,如
type ProcInst struct {
Target string
Inst []byte
}
Directive
- A Directive represents an XML directive of the form . The bytes do not include the markers.
一个指令用XML指令的形式表示,内容不包含< !和> 构成部分。
type Directive []byte
A program to print out the tree structure of an XML document is
打印XML文档的树结构的一个程序,代码如下
/* Parse XML
*/
package main
import (
"encoding/xml"
"fmt"
"io/ioutil"
"os"
"strings"
)
func main() {
if len(os.Args) != 2 {
fmt.Println("Usage: ", os.Args[0], "file")
os.Exit(1)
}
file := os.Args[1]
bytes, err := ioutil.ReadFile(file)
checkError(err)
r := strings.NewReader(string(bytes))
parser := xml.NewDecoder(r)
depth := 0
for {
token, err := parser.Token()
if err != nil {
break
}
switch t := token.(type) {
case xml.StartElement:
elmt := xml.StartElement(t)
name := elmt.Name.Local
printElmt(name, depth)
depth++
case xml.EndElement:
depth--
elmt := xml.EndElement(t)
name := elmt.Name.Local
printElmt(name, depth)
case xml.CharData:
bytes := xml.CharData(t)
printElmt("\""+string([]byte(bytes))+"\"", depth)
case xml.Comment:
printElmt("Comment", depth)
case xml.ProcInst:
printElmt("ProcInst", depth)
case xml.Directive:
printElmt("Directive", depth)
default:
fmt.Println("Unknown")
}
}
}
func printElmt(s string, depth int) {
for n := 0; n < depth; n++ {
fmt.Print(" ")
}
fmt.Println(s)
}
func checkError(err error) {
if err != nil {
fmt.Println("Fatal error ", err.Error())
os.Exit(1)
}
}
Note that the parser includes all CharData, including the whitespace between tags.
注意,解析器包括所有文本节点,包括标签之间的空白。
If we run this program against the person
data structure given earlier, it produces
如果我们运行这个程序对前面给出的 person
数据结构,它就会打印出
person
"
"
name
"
"
family
" Newmarch "
family
"
"
personal
" Jan "
personal
"
"
name
"
"
"
jan@newmarch.name
"
"
"
"
j.newmarch@boxhill.edu.au
"
"
"
person
"
"
Note that as no DTD or other XML specification has been used, the tokenizer correctly prints out all the white space (a DTD may specify that the whitespace can be ignored, but without it that assumption cannot be made.)
注意,因为没有使用DTD或其他XML规范, tokenizer 正确地打印出所有的空白(一个DTD可能指定可以忽略空格,但是没有它假设就不能成立。)
There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as func (c CharData) Copy() CharData
to make a copy of data.
在使用这个解析器过程中有一个潜在的陷阱值得注意:它会为字符串重新利用空间,所以,一旦你看到一个你想要复制它的值的标记,假设你想稍后引用它的话,Go有类似的方法如 func (c CharData) Copy() CharData
来复制数据。
Unmarshalling XML
反编排XML
Go provides a function Unmarshal
and a method func (*Parser) Unmarshal
to unmarshal XML into Go data structures. The unmarshalling is not perfect: Go and XML are different languages.
Go提供一个函数 Unmarshal
和一个方法调用 func (*Parser) Unmarshal
解组XML转化为Go数据结构。解组并不是完美的:Go和XML毕竟是是两个不同的语言。
We consider a simple example before looking at the details. We take the XML document given earlier of
我们先考虑一个简单的例子再查看细节。我们用前面给出的XML文档
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>
We would like to map this onto the Go structures
接下来我们想把这个文档映射到Go结构
type Person struct {
Name Name
Email []Email
}
type Name struct {
Family string
Personal string
}
type Email struct {
Type string
Address string
}
This requires several comments:
这里需要一些说明:
- Unmarshalling uses the Go reflection package. This requires that all fields by public i.e. start with a capital letter. Earlier versions of Go used case-insensitive matching to match fields such as the XML string "name" to the field
Name
. Now, though, case-sensitive matching is used. To perform a match, the structure fields must be tagged to show the XML string that will be matched against. This changesPerson
to
type Person struct {
Name Name `xml:"name"`
Email []Email `xml:"email"`
}
- While tagging of fields can attach XML strings to fields, it can't do so with the names of the structures. An additional field is required, with field name "XMLName". This only affects the top-level struct,
Person
type Person struct {
XMLName Name `xml:"person"`
Name Name `xml:"name"`
Email []Email `xml:"email"`
}
- Repeated tags in the map to a slice in Go
- Attributes within tags will match to fields in a structure only if the Go field has the tag ",attr". This occurs with the field
Type
ofEmail
, where matching the attribute "type" of the "email" tag requiresxml:"type,attr"
- If an XML tag has no attributes and only has character data, then it matches a
string
field by the same name (case-sensitive, though). So the tag
with character data "Newmarch" maps to the string fieldxml:"family"
Family
- But if the tag has attributes, then it must map to a structure. Go assigns the character data to the field with tag
,chardata
. This occurs with the "email" data and the fieldAddress
with tag,chardata
- 使用Go reflection包去解组。这要求所有字段是公有,也就是以一个大写字母开始。早期版本的Go使用不区分大小写匹配来匹配字段,例如XML标签“name”对应
Name
字段。但是现在使用case-sensitive
匹配,要执行一个匹配,结构字段后必须用标记来显示XML标签名,以应付匹配。Person
修改下应该是
type Person struct {
Name Name `xml:"name"`
Email []Email `xml:"email"`
}
- 虽然标记结构字段可以使用XML字符串,但是对于结构名不能这么做 ,这个解决办法是增加一个额外字段,命名“XMLName”。这只会影响上级结构,修改
Person
如下
type Person struct {
XMLName Name `xml:"person"`
Name Name `xml:"name"`
Email []Email `xml:"email"`
}
- 重复标记会映射到Go的slice
- 要包含属性的标签准确匹配对应的结构字段,只有在Go字段后标记”,attr”。举个下面例子中
Email
类型的Type
字段,需要标记
才能匹配带有“type”属性的“email”xml:"type,attr"
- 如果一个XML标签没有属性而且只有文本内容,那么它匹配一个
string
字段是通过相同的名称(区分大小写的,不过如此)。所以标签
将对应着文本”Newmarch”映射到xml:"family"
Family
的string字段中 - 但如果一个标签带有属性,那么它这个特征必须反映到一个结构。Go在字段后标记着
,chardata
的文字。如下面例子中通过Address
后标记,chardata
的字段来获取email的文本值
A program to unmarshal the document above is
解组上面文档的一个程序
/* Unmarshal
*/
package main
import (
"encoding/xml"
"fmt"
"os"
//"strings"
)
type Person struct {
XMLName Name `xml:"person"`
Name Name `xml:"name"`
Email []Email `xml:"email"`
}
type Name struct {
Family string `xml:"family"`
Personal string `xml:"personal"`
}
type Email struct {
Type string `xml:"type,attr"`
Address string `xml:",chardata"`
}
func main() {
str := `<?xml version="1.0" encoding="utf-8"?>
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>`
var person Person
err := xml.Unmarshal([]byte(str), &person)
checkError(err)
// now use the person structure e.g.
fmt.Println("Family name: \"" + person.Name.Family + "\"")
fmt.Println("Second email address: \"" + person.Email[1].Address + "\"")
}
func checkError(err error) {
if err != nil {
fmt.Println("Fatal error ", err.Error())
os.Exit(1)
}
}
(Note the spaces are correct.). The strict rules are given in the package specification.
(注意空间是正确的)。Go在包详解中给出了严格的规则。
Marshalling XML
编组 XML
Go 1 also has support for marshalling data structures into an XML document. The function is
Go1也支持将数据结构编组为XML文档的。这个函数是
func Marshal(v interface}{) ([]byte, error)
This was used as a check in the last two lines of the previous program.
这是用来检查前面程序的最后两行
XHTML
XHTML
HTML does not conform to XML syntax. It has unterminated tags such as '<br>'. XHTML is a cleanup of HTML to make it compliant to XML. Documents in XHTML can be managed using the techniques above for XML.
HTML并不符合XML语法。 它包含无闭端的标签如“< br >”。XHTML是HTML的一个自身兼容XML的子集。 在XHTML文档中可以使用操作XML的技术。
HTML
There is some support in the XML package to handle HTML documents even though they are not XML-compliant. The XML parser discussed earlier can handle many HTML documents if it is modified by
XML包的部分方法可支持处理HTML文档,即使他们本身不具备XML兼容性。前面讨论的XML解析器修改下就可以处理大部分HTML文件
parser := xml.NewDecoder(r)
parser.Strict = false
parser.AutoClose = xml.HTMLAutoClose
parser.Entity = xml.HTMLEntity
Conclusion
结论
Go has basic support for dealing with XML strings. It does not as yet have mechanisms for dealing with XML specification languages such as XML Schema or Relax NG.
Go基本支持对XML字符的处理,而且它不像有着针对XML专用语言如XML Schema或Relax NG的处理机制。
Copyright Jan Newmarch, jan@newmarch.name
If you like this book, please contribute using Flattr
or donate using PayPal