报错信息

org.dom4j.DocumentException: Error on line 1 of document : An invalid XML character (Unicode: 0x2) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x2) was found in the element content of the document.

报错原因

dom4j是用来解析xml文档的。html文档中有一些肉眼难辩的字符,是xml非法字符,所以解析报错。

解决方法

替换非法字符即可

string = string.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");