![]() | Working With Text |
This topic contains the following sections:
Patagames PDF SDK provides APIs to search, select, extract and retrieve text in PDF documents. PDF text contents are stored in PdfText objects which are related to a specific page. TextPage class can be used to retrieve information about text in a PDF page, such as single character, single word, web links, text content within specified character range or rectangle and so on. It also can be used to search text in text contents of a PDF page.
//Load the PDF document. using (var doc = PdfDocument.Load(@"sample.pdf")) { PdfPage page = doc.Pages[0]; int count = page.Text.CountChars; if (count > 0) { int startIndex = 0; string text = page.Text.GetText(startIndex, count); Console.WriteLine(text); } }
The rectangle is specified in the user space coordinate system. Please refer PDF Page for details.
//Load the PDF document. using (var doc = PdfDocument.Load(@"sample.pdf")) { PdfPage page = doc.Pages[0]; //A rectangle in a user space coordinate system. Left, top, right and bottom. FS_RECTF rect = new FS_RECTF(10, 500, 220, 100); string text = page.Text.GetBoundedText(rect); Console.WriteLine(text); }
You can get the text and its bounds as shown in the following code sample.
//Load the PDF document. using (var doc = PdfDocument.Load(@"sample.pdf")) { PdfPage page = doc.Pages[0]; int count = page.Text.CountChars; if (count > 0) { int startIndex = 0; PdfTextInfo textInfo = page.Text.GetTextInfo(startIndex, count); string text = textInfo.Text; foreach(FS_RECTF rect in textInfo.Rects) { //... } } }
Patagames PDF SDK provides APIs to search text in a PDF document.
//Load the PDF document. using (var doc = PdfDocument.Load(@"sample.pdf")) { PdfPage page = doc.Pages[0]; int startIndex = 0; PdfFind found = page.Text.Find("text to find", FindFlags.MatchCase | FindFlags.MatchWholeWord, startIndex); if (found == null) return; //nothing found do { PdfTextInfo foundText = found.FoundText; string text = foundText.Text; FS_RECTF[] bounds = foundText.Rects.ToArray(); } while (found.FindNext()); }
Weblinks are those links implicitly embedded in PDF pages. PDF also has a type of annotation called "link" - PdfLinkAnnotation.
PdfTextWebLinks doesn't deal with that kind of link. PdfText weblink feature is useful for automatically detecting links in the page contents. For example, things like "https://patagames.com" will be detected, so applications can allow user to click on those characters to activate the link, even the PDF doesn't come with link annotations.
using (var doc = PdfDocument.Load(@"sample.pdf")) { PdfPage page = doc.Pages[0]; PdfWebLinkCollection urls = page.Text.WebLinks; //Get web link at the specified point, if any. PdfWebLink webLink = urls.GetWebLinkAtPoint(100, 500); if (webLink == null) { //there is no webLinks at the specified point. } //Get the first web link in the collection. webLink = urls[0]; //get webLink's url and its bounds. string url = webLink.Url; foreach (FS_RECTF rect in webLink.UrlInfo.Rects) { //... } }