Metadata.AUTHOR The name of the author of the document. ![]() The parser implementation sets this property if the document format contains an explicit title field. Metadata.TITLE The title of the document. The parser implementation sets this property to the content type according to which the document was parsed. The declared content type may help the parser to correctly interpret the document. Metadata.CONTENT_TYPE The declared content type of the document.Ī client application can set this property based on for example a HTTP Content-Type header. The parser implementation may set this property if the file format contains the canonical name of the file (for example the Gzip format has a slot for the file name). The following are some of the more interesting metadata properties: Metadata.RESOURCE_NAME_KEY The name of the file or resource that contains the document.Ī client application can set this property to allow the parser to use file name heuristics to determine the format of the document. Document metadata is expressed as an Metadata object. The final argument to the parse method is used to pass document metadata both in and out of the parser. These criteria are reflected in the arguments of the parse method. Many document formats contain metadata like the name of the author that may be useful to client applications. Output metadata A parser implementation should be able to return document metadata in addition to document content. ![]() The parser implementation can use this information to better guide the parsing process. Input metadata A client application should be able to include metadata like the file name or declared content type with the document to be parsed. A client application can use this information for example to better judge the relevance of different parts of the parsed document. Structured content A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. ![]() This allows even huge documents to be parsed without excessive resource requirements. The main criteria that lead to this design were: Streamed parsing The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. Throws IOException, SAXException, TikaException Void parse(InputStream stream, ContentHandler handler, Metadata metadata)
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |