-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOXMLParser Failed Extraction #45
Comments
Hi: https://gitee.com/mrlijing/extractous [patch.crates-io] |
Hello everyone!
I have started using Extractous to extract text and metadata from Microsoft files (doc, docx, xls, xlsx, pptx), but I have encountered several issues with the OOXMLParser, mostly related to missing classes in the Tika Native.
The errors are numerous and varied, but they all correspond to the OOXMLParser. I cannot provide the original documents that trigger the errors, but I tried to create some that replicate the different issues.
Note: To obtain the stack trace of the errors, I compressed the documents into a .zip file and processed them with Extractous. The stack trace of the issue is then present within the metadata.
I cannot provide the documents for these, but I resolved it locally by adding these files to the reachability-metadata.json:
https://github.com/yobix-ai/extractous/blob/main/extractous-core/tika-native/src/main/resources/META-INF/ai.yobix/tika-2.9.2-linux/reachability-metadata.json#L5317
issue2.xlsx
issue2_v2.docx (same issue)
org.graalvm.nativeimage.builder/com.oracle.svm.core.JavaMemoryUtil.copyObjectArrayForwardWithStoreCheck(JavaMemoryUtil.java:495)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.jdk.SubstrateArraycopySnippets.doArraycopy(SubstrateArraycopySnippets.java:113)
at [email protected]/java.util.Arrays.copyOf(Arrays.java:3516)
at [email protected]/java.util.ArrayList.toArray(ArrayList.java:401)
at org.apache.xmlbeans.impl.values.XmlObjectBase.getXmlObjectArray(XmlObjectBase.java:3203)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTDxfsImpl.getDxfArray(CTDxfsImpl.java:54)
at org.apache.poi.xssf.model.StylesTable.readFrom(StylesTable.java:268)
at org.apache.poi.xssf.model.StylesTable.(StylesTable.java:159)
at org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:179)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:144)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 13 more
I cannot provide the file, but the issue can be fixed by adding the missing class to reachability-metadata.json.
Exception in thread "main": org.graalvm.nativeimage.MissingReflectionRegistrationError
org.graalvm.nativeimage.MissingReflectionRegistrationError: The program tried to reflectively instantiate the array class
org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[]
without it being registered for runtime reflection. Add org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraphProperties[] to the reflection metadata to solve this problem. Note: Add "unsafeAllocated" to the array class registration to enable runtime instantiation. See https://www.graalvm.org/latest/reference-manual/native-image/metadata/#reflection for help.
at org.graalvm.nativeimage.builder/com.oracle.svm.core.reflect.MissingReflectionRegistrationUtils.errorForArray(MissingReflectionRegistrationUtils.java:121)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.graal.snippets.SubstrateAllocationSnippets.arrayHubErrorStub(SubstrateAllocationSnippets.java:364)
at org.apache.xmlbeans.impl.values.XmlObjectBase._typedArray(XmlObjectBase.java:442)
at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:482)
at org.apache.xmlbeans.impl.values.XmlObjectBase.selectPath(XmlObjectBase.java:448)
at org.apache.poi.xssf.model.ParagraphPropertyFetcher.fetch(ParagraphPropertyFetcher.java:57)
at org.apache.poi.xssf.usermodel.XSSFTextParagraph.fetchParagraphProperty(XSSFTextParagraph.java:860)
at org.apache.poi.xssf.usermodel.XSSFTextParagraph.isBullet(XSSFTextParagraph.java:728)
at org.apache.poi.xssf.usermodel.XSSFSimpleShape.getText(XSSFSimpleShape.java:202)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:272)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:189)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:192)
at ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:87)
Caused by: org.apache.poi.ooxml.POIXMLException: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation
at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:127)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.tryXSLF(OOXMLExtractorFactory.java:324)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:199)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 13 more
Caused by: java.lang.ClassCastException: Stack trace is imprecise, the top frames are missing and/or have wrong line numbers. To get precise stack traces, build the image with option -H:-ReduceImplicitExceptionStackTraceInformation
at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.readPackagePart(XSLFDiagramDrawing.java:47)
at org.apache.poi.xslf.usermodel.XSLFDiagramDrawing.(XSLFDiagramDrawing.java:43)
at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:662)
at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:679)
at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
at org.apache.poi.xslf.usermodel.XMLSlideShow.(XMLSlideShow.java:125)
... 17 more
at org.apache.poi.xssf.model.CommentsTable.readFrom(CommentsTable.java:86)
at org.apache.poi.xssf.model.CommentsTable.(CommentsTable.java:80)
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.parseComments(XSSFReader.java:436)
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetComments(XSSFReader.java:425)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:161)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:143)
at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:127)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 13 more
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:469)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:297)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:230)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:146)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:247)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:272)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
xceptionStackTraceInformation
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:63)
at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40)
at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221)
Thank you for your work on this repository. I hope you can fix these issues and release a new version of Extractous as soon as possible.
The text was updated successfully, but these errors were encountered: