Regex for Dummies: Day 5
For day five of our series, we’ll examine how to use regular expressions and PHP to “scrape” content from other pages. We’ll be reviewing the preg_match_all function as well as ‘foreach’ statements.
Day 5: Preg_match_all, and Scraping Data
Be sure to click on the “Full Screen Toggle”.
- Subscribe to the ThemeForest RSS Feed for more daily web development screencasts and articles.



















I got confused for a second because the introductory text is the same as day four, just o you know.
*watches*
Great screencast!!
Thank you. You could recommend me a very good book ok expression patterns??
^ I recommend http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/1565922573
I did it like this: http://another-perfect-world.org/junk/getLinks.php.txt I wasn’t very flexible because it’s very hard to get flexible and not make the pattern 128738123 lines long. You can just loop through the results from the function I made.
Hi Jeff,
I came up with:
/]*href=["\']([^"\']+)["\']/i
More notable, it allows for using a ‘ instead of a “, which some people do.
Cheers,
Phil
Thanks for “Regex for Dummies” videos
And, Iit may
<a ‘
instead of
<a ”
@Phil Dufault’s exp. is more convenient
my question is:
i want to create a tamplate engine
i also use the file_get_contents function
and preg replace all { tag } with the content i want?
i know it will work but is it the right way?
Another fantastic video. I had to figure out myself the parenthesis match grouping function only a couple of days ago to enable me to complete a project.
The only reason I managed to complete said project was because of this video series which opened my eyes to Regular Expressions and saved me a lot of manual typing of data.
Keep up the great work!
Here is an awesome library that makes this process really great, by giving jQuery like support for grabbing dom elements from within PHP.
http://simplehtmldom.sourceforge.net/
Wow this is great! Maybe just what I needed to learn.
I need to grab the titles for the links with a second preg_match_all and got that working but how do I add a second foreach loop to echo the title?
Managed with array_combine and then echo out the links and titles within one foreach. Just took me two nights of trial and error testing.
Don’t want to sound 2 needy and all that but I felt lost for a while there with no screencasts from you jeff. you do a great job at teaching us this stuff. I thank you very much.
Thank you for this tutorial…I can better understand iterating now as well as how to append another list item to a unordered list. I have only been studying jQuery and javascript for six months,thus I found this tutorial looking for instruction on how to append the last list item to the beginning of an unordered list in jQuery. As I tried to use $(‘li:last’)appendto(‘ul:first’) (i left out the period quotations on purpose) it only appended the list item to the first list item. Anyway, thank you again.
Eyveneena
Here’s the regex I used.
/]*href=["\'](?!javascript)([^#"\']+)["\']/i
it works like Phil’s, but strips out javascript links (w/a negative lookahead) and strips out links to other anchors on the same page.
You are making things easy man! lovely thanks and please continue doing this!
Hi
nice exercise
this is what i used, i wanted the whole tag instead of href value so
)
Salman
This was a very interesting tutorial. Thanks Jeff!
A question for Jeff or to anyone else that is experienced with scraping: would this be a viable technique to use on a personal site if you wanted to, say, scrape your latest tweet on Twitter?
For example, I could easily see this being used in a real-world example to scrape data from Twitter’s supplied XML feeds for your account.
I suspect this may not be optimal though. I recall reading something in the comments here on Themeforest that its best to use a cache of some sort so that your servers don’t have to keep ’scraping’ Twitter every time the page loads. This is a new topic for me, so don’t quote me on that. Just trying to put the pieces together.
If scraping isn’t optimal for this use, can someone post up a link to a technique that would be a better choice?
Thanks for this great screencast! Really cool!
One question: whats the name of the espresso theme you use? I like it. Where can I get this theme? In the Coffee House I didn’t find it…
Hi Jeff,
Thanks a bunch for the tutorials.
I subscribed to the feed using my browser’s built in RSS reader. Does that count or do I have to use feedburner?
I am trying to use jquery to break Right Ascension / Declination data into their constituents (hours , minutes, and seconds) and (degrees, arc-minutes and arc-seconds), respectively from a string and store them in variables as numbers.
E.g. $dec = “-35:48:00″ -> $dec_d = -35, $dec_m = 48, $dec_s = 00
I have gotten as far as
(-|)+\d+
to get the expression to extract the first part but now I am stuck. I cannot figure out how to use jquery to break apart the string and store the information in the variables I need.
Maybe this will inspire a topic for a future tutorial. I appreciate any help you can provide. I have searched long and hard online but have not found something that makes sense.
How would I select everything that is not “xyz.” But not x,y or z. I want them as one group.
Apparently [^(xyz)] picks either of the letters. I want to say that it should pick the things not having xyz as group together.
Is that possible?
Thanks
I use regexpr in R and in a program called search everything. I know there is a boolean OR; it is |. However is there a boolean AND? For instance, how do I match a string that I know has “farrel” somewhere in it and has “august” somewhere in it.