Thursday, February 9, 2023
HomeSoftware DevelopmentLearn a Specific Web page from a PDF File in Python

Learn a Specific Web page from a PDF File in Python


Doc processing is without doubt one of the most typical use circumstances for the Python programming language. This enables the language to course of many information, resembling database information, multimedia information and encrypted information, to call just a few. This text will educate you learn how to learn a selected web page from a PDF (Transportable Doc Format) file in Python.

Methodology 1: Utilizing Pymupdf library to learn web page in Python

The PIL (Python Imaging Library), together with the PyMuPDF library, will probably be used for PDF processing on this article. To put in the PyMuPDF library, run the next command within the command processor of the working system:

pip set up pymupdf

Word: This PyMuPDF library is imported by utilizing the next command.

import fitz

Studying a web page from a pdf file requires loading it after which displaying the contents of solely one in all its pages. This basically makes that one-page equal of a picture. Subsequently, the web page from the pdf file can be learn and displayed as a picture. 

The next instance demonstrates the above course of:

Python3

import fitz

from PIL import Picture

  

input_file = r"take a look at.pdf"

  

file_handle = fitz.open(input_file)

  

web page = file_handle[0]

  

page_img = web page.get_pixmap()

  

page_img.save('PDF_page.png')

  

img = Picture.open('PDF_page.png')

  

img.present()

Output:

 

Rationalization:

Firstly the pdf file is opened, and its file deal with is saved. Then the primary web page of the pdf (at index 0) is loaded utilizing record indexing. This web page’s pixel map (pixel array) is obtained utilizing the get_pixmap operate, and the resultant pixel map is saved in a variable. Then this pixel map is saved as a png picture file. Then this png file is opened utilizing the open operate current within the Picture module of PIL. In the long run, the picture is displayed utilizing the present operate. 

Word: The primary open operate is used to open a pdf file, and the later one is used to open the png picture file. The capabilities belong to completely different libraries and are used for various functions. 

Methodology 2: Studying a selected web page from a PDF utilizing PyPDF2 

For the second instance, the PyPDF2 library can be used. Which may very well be put in by working the next command:

pip set up PyPDF2

The identical goal may very well be achieved by utilizing the PyPDF2 library. The library permits processing for pdf information and permits numerous operations resembling studying, writing or making a pdf file. For the duty at hand, using the extract textual content operate can be made to acquire the textual content from the PDF file and show it. The code for that is as follows:

Python3

import PyPDF2

    

input_file = r"take a look at.pdf"

  

web page = 4

  

pdfFileObj = open('take a look at.pdf', 'rb')

    

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    

pageObj = pdfReader.getPage(web page)

    

information = pageObj.extractText()

  

pdfFileObj.shut()

  

print(information)

Output:

He began this Journey with only one 
thought- each geek ought to have 
entry to a by no means ending vary of 
educational sources and with so much 
of hardwork and dedication, 
GeeksforGeeks was born.
By this platform, he has        
efficiently enriched the minds of 
college students with information which has 
led to a lift of their careers. However 
most significantly, GeeksforGeeks 
will all the time assist college students keep in 
contact with their Geeky facet!
EXPERT ADVICE
CEO and Founding father of 
GeeksforGeeks
                  I perceive that many 
college students who come to us are 
both followers of the sciences or 
have been pushed into this 
feild by their dad and mom.
And I simply need you to 
know that regardless of 
the place life takes you, we 
at GeeksforGeeks hope 
to have made this 
journey simpler for  
you.Mr. Sandeep Jain
3

Rationalization:

Firstly the trail to the enter pdf and the web page quantity are outlined in separate variables. Then the pdf file is opened, and its file object is saved in a variable. Then this variable is handed as an argument to the PdfFileReader operate, which creates a pdf reader object out of a file object. Then the information saved inside the web page quantity outlined within the web page variable is obtained and saved in a variable. Then the textual content is extracted from that PDF web page, and the file object is closed. In the long run, the extracted textual content information is displayed.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments