Regex for Dummies: Day 5

For day five of our series, we’ll examine how to use regular expressions and PHP to “scrape” content from other pages. We’ll be reviewing the preg_match_all function as well as ‘foreach’ statements.



Day 5: Preg_match_all, and Scraping Data

Be sure to click on the “Full Screen Toggle”.



20

Comments
  • Vasili says:

    I got confused for a second because the introductory text is the same as day four, just o you know. :P

    *watches*

  • shandercage says:

    Great screencast!! :) Thank you. You could recommend me a very good book ok expression patterns??

  • Vasili says:

    ^ I recommend http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/1565922573

    I did it like this: http://another-perfect-world.org/junk/getLinks.php.txt I wasn’t very flexible because it’s very hard to get flexible and not make the pattern 128738123 lines long. You can just loop through the results from the function I made. :)

  • Thanks for “Regex for Dummies” videos :)

    And, Iit may
    <a ‘
    instead of
    <a ”

    @Phil Dufault’s exp. is more convenient

  • Gilad says:

    my question is:
    i want to create a tamplate engine

    i also use the file_get_contents function
    and preg replace all { tag } with the content i want?

    i know it will work but is it the right way?

  • Paul Woodward says:

    Another fantastic video. I had to figure out myself the parenthesis match grouping function only a couple of days ago to enable me to complete a project.

    The only reason I managed to complete said project was because of this video series which opened my eyes to Regular Expressions and saved me a lot of manual typing of data.

    Keep up the great work!

  • Here is an awesome library that makes this process really great, by giving jQuery like support for grabbing dom elements from within PHP.

    http://simplehtmldom.sourceforge.net/

  • Paperboy says:

    Wow this is great! Maybe just what I needed to learn. :)

    I need to grab the titles for the links with a second preg_match_all and got that working but how do I add a second foreach loop to echo the title?

  • Paperboy says:

    Managed with array_combine and then echo out the links and titles within one foreach. Just took me two nights of trial and error testing. :D

  • Joe says:

    Don’t want to sound 2 needy and all that but I felt lost for a while there with no screencasts from you jeff. you do a great job at teaching us this stuff. I thank you very much.

  • Eyveneena says:

    Thank you for this tutorial…I can better understand iterating now as well as how to append another list item to a unordered list. I have only been studying jQuery and javascript for six months,thus I found this tutorial looking for instruction on how to append the last list item to the beginning of an unordered list in jQuery. As I tried to use $(‘li:last’)appendto(‘ul:first’) (i left out the period quotations on purpose) it only appended the list item to the first list item. Anyway, thank you again.

    Eyveneena

  • Tom says:

    You are making things easy man! lovely thanks and please continue doing this!

  • Salman says:

    Hi

    nice exercise

    this is what i used, i wanted the whole tag instead of href value so

    )

    Salman

  • Shaun C. says:

    This was a very interesting tutorial. Thanks Jeff!

    A question for Jeff or to anyone else that is experienced with scraping: would this be a viable technique to use on a personal site if you wanted to, say, scrape your latest tweet on Twitter?

    For example, I could easily see this being used in a real-world example to scrape data from Twitter’s supplied XML feeds for your account.

    I suspect this may not be optimal though. I recall reading something in the comments here on Themeforest that its best to use a cache of some sort so that your servers don’t have to keep ’scraping’ Twitter every time the page loads. This is a new topic for me, so don’t quote me on that. Just trying to put the pieces together.

    If scraping isn’t optimal for this use, can someone post up a link to a technique that would be a better choice?

  • plutonium says:

    Thanks for this great screencast! Really cool!

    One question: whats the name of the espresso theme you use? I like it. Where can I get this theme? In the Coffee House I didn’t find it… :-(

  • Kamal Prasad says:

    Hi Jeff,

    Thanks a bunch for the tutorials.

    I subscribed to the feed using my browser’s built in RSS reader. Does that count or do I have to use feedburner?

    I am trying to use jquery to break Right Ascension / Declination data into their constituents (hours , minutes, and seconds) and (degrees, arc-minutes and arc-seconds), respectively from a string and store them in variables as numbers.

    E.g. $dec = “-35:48:00″ -> $dec_d = -35, $dec_m = 48, $dec_s = 00

    I have gotten as far as
    (-|)+\d+
    to get the expression to extract the first part but now I am stuck. I cannot figure out how to use jquery to break apart the string and store the information in the variables I need.

    Maybe this will inspire a topic for a future tutorial. I appreciate any help you can provide. I have searched long and hard online but have not found something that makes sense.

  • Max says:

    How would I select everything that is not “xyz.” But not x,y or z. I want them as one group.

    Apparently [^(xyz)] picks either of the letters. I want to say that it should pick the things not having xyz as group together.

    Is that possible?

    Thanks

  • Farrel says:

    I use regexpr in R and in a program called search everything. I know there is a boolean OR; it is |. However is there a boolean AND? For instance, how do I match a string that I know has “farrel” somewhere in it and has “august” somewhere in it.