Menu
Get the latest news and updates! Sign up for our mailing list!
PDF parsing is a black art that most programmers avoid. “Madness lurks here.” They mumble to themselves quietly. Choosing instead to push their PDFs through UIWebViews and commit other crimes against humanity.
It doesn’t have to be this way, however. Parsing, displaying, and searching PDFs natively and at a low level is actually surprisingly easy if you’re not afraid to get your hands a little dirty with the Core Graphics PDF functions. I’m going to show you how.
The first thing to know is that in order to do this, you need to use Core Graphics calls. So you need to include the Core Graphics framework in your project, and in any files you want to use the calls, you have to include the CoreGraphics.h header. It’s probably also worthwhile to review the Core Foundation memory management rules.
Once you’ve done this, it’s very straight forward to read your PDF files and display them in a custom view. Let’s take a look at how we do that.
To initialize a PDF document, you first have to use the call CGPDFDocumentCreate, passing in the URL to the document you want to open. Since NSURL is toll free bridged to CFURLRef, you can create a CFURLRef just using plain old NSURL like so:
NSString *pathToPdfDoc = [[NSBundle mainBundle]
pathForResource:@"mypdf" ofType:@"pdf"];
NSURL *pdfUrl = [NSURL fileURLWithPath:pathToPdfDoc];
Then, to create the CGPDFDocumentRef, call CGPDFDocumentCreateWithURL:
CGPDFDocumentRef document = CGPDFDocumentCreateWithURL((CFURLRef)pdfUrl);
So now you have a document. To display the content of the document, you have to get the content in the form of pages. PDFs are already formatted by pages, so all you need to do is get at that data. Fortunately Core Graphics has functions for that too.
To get the total count of the pages in the document, you use the call CGPDFDocumentGetNumberOfPages, which takes as a parameter, the document you created above. So, for example:
size_t pageCount = CGPDFDocumentGetNumberOfPages(document);
Then, to get an individual page to display in your view, you use the function CGPDFDocumentGetPage, passing the document and the page number you want. Like so:
CGPDFPageRef page = CGPDFDocumentGetPage(document, currentPage);
Note that the currentPage parameter here is 1 based, not 0 based as is the usual case in programming. This means that the first page of the PDF document is in fact, page 1, and not page 0.
Once you have the page, you can display it in your custom view. The only complicated part here is that on iPhone, the coordinate system is flipped compared to the Mac. This causes a problem because the Core Graphics PDF system uses the desktop coordinate system even on iPhone. (It’s yucky, I know.) The solution to this is to flip the page (this can be done in your drawRect method when you go to draw the content):
CGPDFPageRef page = CGPDFDocumentGetPage(document, currentPage);CGContextRef ctx = UIGraphicsGetCurrentContext();CGContextSaveGState(ctx);CGContextTranslateCTM(ctx, 0.0, [self bounds].size.height); CGContextScaleCTM(ctx, 1.0, -1.0); CGContextConcatCTM(ctx, CGPDFPageGetDrawingTransform(page, kCGPDFCropBox, [self bounds], 0, true));
The key here is the call to CGContextScaleCTM. What we do, is we get the current drawing context, and then we scale it’s coordinate system on it’s y axis by -1.0. This, effectively, flips it upside down along it’s horizontal (x) axis.
Finally, we draw the page into the context using the CGContextDrawPDFPage function:
CGContextDrawPDFPage(ctx, page); CGContextRestoreGState(ctx);p. </div>So basically, a full on @drawRect@ method for a custom view that draws content from a PDF page, looks something like this:<div class="code"> bc.. -(void)drawRect:(CGRect)inRect; { if(document) { CGPDFPageRef page = CGPDFDocumentGetPage(document, currentPage);CGContextRef ctx = UIGraphicsGetCurrentContext();CGContextSaveGState(ctx);CGContextTranslateCTM(ctx, 0.0, [self bounds].size.height); CGContextScaleCTM(ctx, 1.0, -1.0); CGContextConcatCTM(ctx, CGPDFPageGetDrawingTransform(page, kCGPDFCropBox, [self bounds], 0, true));CGContextDrawPDFPage(ctx, page); CGContextRestoreGState(ctx); } }p. </div>That's all there is to it!
One of the things that seems to be particularly scary to programmers is searching PDFs. I agree that it’s certainly not pleasant stuff to code, but it’s not hard either.
Now, I want to preface this by saying that I feel this code is a bit of a hack, but it definitely works, and seems to work quite well. Perhaps there’s a better way to do this, and if you know of one, please let me know. That said, however, here’s how I’ve done it.
The first thing to know is that PDF files are made up of operators which delineate the data within them. So, for example, all text in a PDF document is stored as glyphs and prefixed by operators of type either “Tj”, in the case of a string, or “TJ” in the case of an array of strings. Knowing this, you can access the PDF data as a stream and create a scanner which will call callback methods you specify when these operators are encountered. You can then retrieve the data after the operator and use it to build your search corpus.
That probably sounds intimidating, but it’t not. You start out by creating a class that will be your “page searcher.” This will hold the state for your search engine. Here’s the listing for the interface for this class:
#import <Foundation/Foundation.h>@interface PDFSearcher : NSObject { CGPDFOperatorTableRef table; NSMutableString *currentData; } @property (nonatomic, retain) NSMutableString * currentData; -(id)init; -(BOOL)page:(CGPDFPageRef)inPage containsString:(NSString *)inSearchString; @endp. </div>Pretty straight forward stuff. We use the currentData member to store the text of the page being scanned. This is a member variable rather than a local variable because we're going to be using C functions to fill it in. Don't worry, that'll make sense in a moment.The @init@ method for the class actually creates the callback table:<div class="code"> bc.. -(id)init { if(self = [super init]) { table = CGPDFOperatorTableCreate(); CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback); CGPDFOperatorTableSetCallback(table, "Tj", stringCallback); } return self; }
The arrayCallback and the stringCallback functions are C functions that will be called by the scanner. They’re shown here:
void arrayCallback(CGPDFScannerRef inScanner, void *userInfo) { PDFSearcher * searcher = (PDFSearcher *)userInfo;CGPDFArrayRef array;bool success = CGPDFScannerPopArray(inScanner, &array);for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2) { if(n >= CGPDFArrayGetCount(array)) continue;CGPDFStringRef string; success = CGPDFArrayGetString(array, n, &string); if(success) { NSString *data = (NSString *)CGPDFStringCopyTextString(string); [searcher.currentData appendFormat:@"%@", data]; [data release]; } } }void stringCallback(CGPDFScannerRef inScanner, void *userInfo) { PDFSearcher *searcher = (PDFSearcher *)userInfo;CGPDFStringRef string;bool success = CGPDFScannerPopString(inScanner, &string);if(success) { NSString *data = (NSString *)CGPDFStringCopyTextString(string); [searcher.currentData appendFormat:@" %@", data]; [data release]; } }p. </div>As you can see, these will be called when the operators fire. When they do, we pop the data off the scanner, and add it to the searcher's corpus. The userinfo pointer is actually pointing to our searcher object (based on the fact that we will pass it as the second parameter to @CGPDFScannerCreate@ in the next code). So we can typecast it to a PDFSearcher and then access that currentData member (remember I said it would make sense later?).The actual search method looks like this:<div class="code"> bc.. -(BOOL)page:(CGPDFPageRef)inPage containsString:(NSString *)inSearchString;{ [self setCurrentData:[NSMutableString string]]; CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(inPage); CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, self); bool ret = CGPDFScannerScan(scanner); CGPDFScannerRelease(scanner); CGPDFContentStreamRelease(contentStream); return ([[currentData uppercaseString] rangeOfString:[inSearchString uppercaseString]].location != NSNotFound); }p. </div>Basically, we create a stream from the page data, then use that and our callback table to create a scanner. We then scan the data. It's at this point our currentData member is being filled with the data from the PDF as strings. Finally, we just search that string for our search string.Easy peezy.Note: much of this code is only sight compiled. I pulled it from some code I had, but it wasn't a straight across copy, so if you find an error, please let me know.
Very slick, Jiva. I wouldn't have guessed that it would be this straightforward, as you note. Definitely storing this away for future reference, thanks!
I'm curious, there are a couple other text operators in PDF, single quote and double quote, which are composites of Tj/TJ and other operators. Does the CGPDF system decompose these to Tj/TJ or do you have to add a function for them also? Smart use of the CGPDF* API's! Particularly powerful on the iPhone. Now all you need are some heuristics to put together strings in the proper order based on their position in the page :)
With regard to the single/double quote question, I don't know if CGPDF converts those to Tj/TJ - my gut feeling is that it doesn't, so this could be a bug in my code. When I get some time, I'll toss together a unit test for that and see what I get.
Nice post. I'd casually read before that Core Graphics's internal drawing model is very similar to PDF but didn't realize that meant CG has such straightforward functions for handling PDFs. Good to know.
This was awesome! Thanks for posting this! been looking for some references for working with pdf's, but haven't had much success.
I was trying to do the zooming with ur example. And it is working when i subclassed UIScrollView and tried. But the problem is that on zooming, the page itself is zoomed. So it becomes more and more blurred when i zoom out. I think after the zooming we should call the setNeedsDisplay. It will call drawRect. Then it will load the page for the new ContentSize and it should be ok. Even though it seems it may work, its not actually working. Can u help me? What changes should I bring to the drawRect that you have wrote so that it will work for the UIScrollView?
The code that you have explained here - is no where & unmatched. But what I am trying is something different then yours. But - I want to extract entire text from pdf. & your given code is just performing the search. Is it possible to implement - reading entire pdf & storing it into a string.
By experimenting on JIVA's example, what I got was, he is getting all the data in the page that he is parsing in the string "currentData". He have written only functions for two pdf operators since it is a tutorial. When it encounters the remaining operators some unknown symbols will also come in between the data that got stored in that string. So I think that, if you write functions for the remaining needed pdf operators, you will get all the data in a page very clearly in that string.
I was trying to do the zooming with ur example. And it is working when i tried it with UIScrollView and the subclasses UIView object inside it. But the problem is that on zooming, the page itself is zoomed. So it becomes more and more blurred when i zoom out. I think after the zooming we should call the setNeedsDisplay. It will call drawRect. Then it will load the page for the bigger view(noz zoomed out) and it should be ok. Even though it seems it may work, its not actually working. Can u help me? What changes should I bring to the drawRect that you have wrote so that it will work when I have zoomed it out using UIScrollView?
I use the code, but when I add the operator table callback, the callback function is not work. I want to get the chapter or catalog of the pdf, is it can get? Thank you!
Hey, what's up man?? Friends, i am beginner in iPhone programming, i need to open pdf using core graphics, i tried above code but failed to display. Help me please, i didn't use any webview, subview or scrollview in IB. Thnks.
I have used ur coding in searching the string in pdf file, but I have some other tasks to do. I have to highlight the searched text in pdf. As like what the acrobat pdf reader does. Is there any possibility. PLease let me know. I m in urgency. Thanks in advance
This is a great start, and does manage to extract a LOT of the text. However, I also get a lot of garbage instead of some of the text. There's also the issue of concatenation (I suspect the TJ handler, arrayCallback, may need to look at the inter-string space and generate a space character if it's large enough). I'd love to see a more complete version, either the original author or a commenter who said they got it to work.
I am trying for over a week to implement this but as I am a big noob i can't manage to pull it through. Please if it's possible to link a full working source code. Thank you in advance!
ponka
I have implemented pdf reader application for ipad.Its working fine.But i want to parse table of content from the pdf.I dont know how it possible.Please help me for this problem. My email id:manthan_dll@yahoo.co.in Contact no:9824369136
Hi, Thanks for useful code. I have to highlight the searched text in pdf. As like what the acrobat pdf reader does. Is there any possibility. Please let me know. I m in urgency. Thanks in advance
Hi, I am really thankful to u for posting such a useful piece of code.It works fine with my reader. Currently I am in r&d phase of my project and facing a problem regarding highlighting the searched text.Please provide me some idea of this,i'll be highly obliged for the same. Thanks once again.
Hello, Kindly let me know how to pass the text to be searched in this code,and how to call the 'page' method for searching? What is the full code to search and highlight the text in pdf in iphone? Plz reply ASAP. Thanks!!!
Hello, Kindly let me know how to pass the text to be searched in this code,and how to call the 'page' method for searching? What is the full code to search and highlight the text in pdf in iphone? Plz reply ASAP. Thanks!!!
Hello, Kindly tell how to pass the text to be searched and call 'page' method for search in this code and how to highlight the same? plz reply ASAP. Thanks.
Hello, Kindly let me know how to pass the text to be searched in this code,and how to call the 'page' method for searching? What is the full code to search and highlight the text in pdf in iphone? Plz reply ASAP. Thanks!!!
Hi Jiva DeVoe, its work for me, Thanks
Hi Jiva, I am able to search text using the above code. But I am facing a problem.Searching is perfect in some pdfs only. If I search in pdf like for e.g. iPhone101.pdf or such apple pdfs, it gives no results or weird results. If i print the string content, it shows junk characters. Please provide me any reference or code if anyone has found solution
Excellent resource Jiva, I will certainly be tucking this one away for future use.