High Volume PDF Text Extraction using Python Open-Source Tools — Harald Lieder



[EuroPython 2023 — South Hall 2A on 2023-07-20]

All major companies have huge amounts of (mostly PDF) documents that contain important – even critically important – information, that does no longer exist anywhere else in their data stores.

Reports, once generated for shareholders and legal or financial authorities, may still be useful for developing longterm forecasts or triggering company management decisions.

By definition, documents are intended for human perception, and as such contain unstructured data from an information technology perspective.

Therefore, tools to extract PDF text content (mostly, but not only text) from millions of pages have become important vehicles to recreate structured information.

This presentation talks about extraction “need for speed” in this Big Data scenario, the need for integration with OCR capabilities and presents an open-source toolset which combines both, top-of-the-class performance and maximum extraction detail.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

source

Disclaimer
The content published on this page is sourced from external platforms, including YouTube. We do not own or claim any rights to the videos embedded here. All videos remain the property of their respective creators and are shared for informational and educational purposes only.

If you are the copyright owner of any video and wish to have it removed, please contact us, and we will take the necessary action promptly.

Scroll to Top