python - getting the text from most popular news stories -


i trying scan cnn.coms popular news stories , extract news article top ten links or , save article text can count used words in it. not getting top links web page code. appreciated. how make @ first ten links found on cnn.com/mostpopular?

import urllib2 bs4 import beautifulsoup  html = urllib2.urlopen('http://www.cnn.com/mostpopular/').read() soup = beautifulsoup(html) item in soup.find_all(attrs={'class': 'cnnwcboxcontent'}):      link in item.find_all('a'):         item in link.get('href')             #soups = beautifulsoup(item)             #soups.find_all(             print item 

to interested in need access "cnnmostpopulartabs1" , "cnnmpcontentheadline":

from bs4 import beautifulsoup

import requests  r = requests.get("http://edition.cnn.com/mostpopular/")  data = beautifulsoup(r.content).find("div",{"id":"cnnmostpopulartabs1"}).find_all("div",{"class":"cnnmpcontentheadline"})  pprint import pprint pp pp([d.a["href"] d in data]) 

output:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html',  'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html',  'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html',  'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html',  'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html',  'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html',  'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html',  'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html',  'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html',  'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html'] 

you slice find_all("div",{"class":"cnnmpcontentheadline"}):

data = beautifulsoup(r.content).find_all("div",{"class":"cnnmpcontentheadline"}) pprint import pprint pp pp([d.a["href"] d in data[:10]]) 

output:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html',  'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html',  'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html',  'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html',  'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html',  'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html',  'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html',  'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html',  'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html',  'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html'] 

i recommend not slicing there possibility there more or less links.

to paragraph text can find cnn_strylftcntnt find_all_next p's:

for link in (d.a["href"] d in data):     r = requests.get(link)     div = beautifulsoup(r.content).find("div",{"class":"cnn_strylftcntnt"})     if div:         print("text {}".format(link))         print("".join([p.text p in div.find_all_next("p")]))     else:         print("no text link {}".format(link))     print() 

output:

text http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html (cnn) -- gone days of grainy camera phone images resolution of poor imitation monet. today's smartphone cameras advanced mobile photography becoming art form in own right, turning photo-sharing apps instagram portable galleries amateur photographers, , professionals street style photographer tommy ton , chief official white house photographer pete souza."you have dark room in pocket," says pierre le govic, paris-based founder of out of phone, world's first publishing house dedicated mobile photography.this month, out of phone follows debut publication, last year's book of mobile photos two-time pulitzer prize-nominated photographer richard koci hernandez, out of phone: mobile photo book 2014, diverse selection of 100 instagram images taken users 25 countries.read: decaying splendor of abandoned italian nightclubsdemocratizing photography before founding out of phone in 2013, le govic ran fine art photography printing company counted daido moriyama , william eggleston clients. first started following mobile photography on instagram in 2011, , surprised , impressed quality of work hobbyists creating."now there many known photographers use platform, @ beginning, there many people didn't know photography, , these kind of people wanted showcase," says. "but on other hand, confusing because there many images."the desire curate seeing, coupled longtime ambition create books, led him give publishing try.while le govic had preselected number of established photographers feature in year's inaugural anthology (he's hoping become annual publication), gave instagram users chance put consideration, using hashtag #outofthephone nominate best works. astounded receive on 20,000 submissions.what looking in successful entry? technical skill understandably important, le govic says sought less tangible."at end, important story , sensibility of photographer ... it's mix between story, composition," says. "photography, me, sort of fresh air, way @ things differently. i'm looking sort of feeling when @ pictures."preserving "moments of grace"now mobile photo book has been published, le govic looking forward promoting concept , expanding. he's looking start hiring in new year (so far, it's been one-man operation), , solicit investors , partners. several projects set release next year, including books award-winning documentary photographer benjamin lowy, , other photographers believes using medium fullest.read: behind scenes @ legendary studio 54"some images deserve paper because it's kind of memory," says. "if can keep memory of interesting moments, moments of grace perhaps...i think it's interesting fix them on paper , alert people not forget them."out of phone: mobile photo book 2014 available purchase online.unseen pictures of rolling stones , pink floydsupercar shangri-la: full throttle through italy's 'motor valley'this aerial photographer captures eerie geometry of lifea peek inside europe's prestigious photography festival  text http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html (cnn) -- utah couple journey through in-vitro fertilization captivated nation welcomed quadruplets -- 2 sets of identical twins -- sunday.ashley , tyson gardner said "overwhelmed joy" after birth of indie, esme, scarlett , evangeline caesarean section @ utah valley regional medical center in provo. 3 of newborns weighed little more 2 pounds @ delivery. fourth weighed less 2 pounds, according hospital.the gardners announced news on facebook page share news pregnancy."mom , babies doing incredible!!! happy how turned out today! doctors, nurses, , staff incredible!! more updates follow soon!!"the pleasant grove couple conceived 2 sets of identical twins summer of in-vitro fertilization. in october, ashley gardner had emergency laser surgery in california save 1 set suffering twin-to-twin transfusion syndrome, hospital said in news release. began staying in antepartum suite @ utah valley regional in november after doctors decided hospital bed rest necessary.the 4 girls, dubbed "quad squad" hospital, due march 11. doctors decided deliver them 12 weeks after discovering ashley gardner had ruptured membranes , contractions continued progress in intensity, hospital said.complications leading premature delivery common in multiple gestations, whether achieved naturally or though ivf, said dr. andrew toledo, ceo of reproductive biology associates in atlanta, largest ivf program in southeast. data show women achieve pregnancy through ivf have higher rate of complications compared patients conceive naturally.it's extremely rare both embryos split, it's more common in ivf pregnancies compared patients conceive naturally, said.in youtube video posted sunday morning hospital, tyson gardner said mom , babies doing after night in hospital , expected quads come in next couple of days."we need lots of prayers next 48 hours," ashley gardner said hospital bed.the gardners tried years pregnant. finally, learned in july first in-vitro fertilization attempt successful. real surprise came during ultrasound, when learned pregnant quadruplets.a friend in room captured priceless on face in picture took internet storm. in 1 week, gardners' facebook page grew 16,000 likes 24,300. today, has 300,000 facebook fans, , tv network tlc following them series set air in 2015.well-wishers flooded facebook page monday congratulations , requests pictures."congratulations," 1 person said. "wishing health , happiness many years come."  text http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html pyongyang (cnn) -- exceedingly rare western journalists allowed inside democratic peoples republic of korea (dprk) -- commonly known north korea. less common american reporter visit reclusive nation, home 25 million people isolated rest of world.yet here am, american member of cnn crew, reporting pyongyang latest high profile sporting event sweep city since bizarre basketball tournament earlier year.you remember when american nba star dennis rodman organized basketball tournament in pyongyang.rodman criticized in united states befriending dprk's supreme leader kim jong un, authoritarian regime has been accused united nations panel of widespread human rights abuses, charges north korea denies. 'sports diplomacy'outside press not invited cover rodman's trip. time, cnn among handful of news organizations granted rare access pyongyang cover international pro wrestling festival.retired japanese wrestling star turned politician kanji "antonio" inoki organizing event. in professional heyday, inoki fought in memorable , bizarre 1976 match in tokyo boxing great muhammad ali. today, aging member of japanese parliament, once again in headlines latest attempt @ calls "sports diplomacy" between japan , north korea.inoki holding event in home country of rikidozan, late wrestling mentor. says bring professional fighters united states, china, , several other countries. wrestlers scheduled tour pyongyang , interact north korean fans.our journey farafter landing in pyongyang, headed our hotel,which sits on own island.complete microbrewery, hotel tries give journalists on trip western experience, serving simple western-style omelettes , potatoes breakfast. dinner korean-style meal.taking around city, saw people holding cell phones, looked small blackberrys. people weren't blindly walking eyes locked on screen; common sight in western cities.these not touch-screen phones, instead gadgets people can access internal net , visit north korean sites government sites , country's largest newspaper.on friday morning, visited birthplace of north korean founder, kim il sung. site considered sacred -- every north korean visits capital goes there. bus loads of school children, took 23-hour trip northern rural province, arrived @ site take look.asked how felt being there, students recited facts place. when our minders encouraged them speak us, appeared shy or nervous facing foreigners , tv cameras.we headed munsu water park, park water slides , pools, current leader, kim jong un, said have scrutinized 113 times. there weren't many children there, though many north korean families appeared enjoying activities.the rest of friday spent visiting new pediatric hospital , sports village -- in pyongyang.during our tightly-controlled five-day trip, under constant supervision of government minders. staying in hotel on island -- in middle of river -- , aren't allowed leave without our government-assigned escorts. expect them monitor shoot , step-in stop if point our cameras in wrong direction.we expect see government allow see -- landmarks of pyongyang, omnipresent tributes kim family regime, , majestic displays of patriotic pageantry.thawing relationsthis unusual visit hermit kingdom comes @ time when years of frosty relations between tokyo , pyongyang beginning thaw.in july, japanese prime minister shinzo abe eased several unilateral sanctions on north korea after 2 countries made progress in talks japanese citizens kidnapped north korean regime during cold war.the japanese government says north korean operatives kidnapped @ least 17 japanese citizens in late 1970s , 1980s , possibly dozens more.in 2002, north korea shocked international community admitting kidnappings , returning 5 victims japan. questions still linger fate of remaining 12 confirmed abductees , other suspected cases.a north korean "special investigative committee" of 30 government officials expected update japanese government in next few weeks on status of missing japanese citizens. families of abducted hope renewed diplomacy between 2 countries bring long-awaited answers.among japanese sanctions lifted restriction asking citizens not travel north korea, opens door more japanese tourists embark on commercial tours of country.behind curtainour flight on north korea's airline (one of 10 scheduled flights week) packed japanese press , eclectic group of wrestlers tour pyonyang , entertain crowds see in country.at press conference, 1 north korean official said hopes event bring dprk closer japan after years of tension.even though decades of isolation , crippling sanctions have left north korea struggling economically , lagging far behind of developed world in terms of technology , infrastructure -- nation unrivaled in ability mobilize tens of thousands of citizens put on spectacular show.it remains yet seen if glimpse behind curtain witness true reality of life in 1 of secretive places on earth.i asked our government minders if they'd willing show life regular people in north korea. said they'd ask superiors , us.read: dennis rodman returns after visit north korearead: abductee's parents meet north korean granddaughter  ........... 

i add couple of outputs there limit of 30000 characters.

you no text following link there no cnnmpcontentheadlin or cnn_strylftcntnt tags:

no text link http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html 

if want word count use collections.counter dict, lowering text , stripping punctuation words:

from collections import counter, ordereddict itertools import chain string import punctuation  all_links_counters = ordereddict()  link in [d.a["href"] d in data][0:1]:     r = requests.get(link)     div = beautifulsoup(r.content).find("div", {"class": "cnn_strylftcntnt"})     if div:         print("text {}".format(link))         words = chain.from_iterable(p.text.lower().split() p in div.find_all_next("p"))     all_links_counters[link] = counter(word.strip(punctuation) word in words)     else:         print("no text link {}".format(link))     print()  print(all_links_counters) 

an example output first link:

[counter({'the': 33, 'of': 23, 'to': 20, 'a': 15, 'and': 13, 'he': 10, 'photography': 8, 'for': 7, 'mobile': 7, 'in': 7, 'was': 6, 'that': 6, 'are': 6, 'photographer': 6, 'phone': 6, 'is': 6, 'at': 5, 'govic': 5, 'le': 5, 'out': 5, 'says': 5, "it's": 4, 'so': 4, 'instagram': 4, 'images': 4, 'photographers': 4, 'looking': 4, 'book': 4, 'with': 3, 'on': 3, 'what': 3, 'people': 3, 'moments': 3, 'its': 3, 'but': 3, 'i': 3, 'there': 3, 'photo': 3, 'from': 3, 'also': 3, 'many': 3, 'were': 3, 'this': 3, '': 2, 'now': 2, "he's": 2, 'interesting': 2, 'some': 2, 'publishing': 2, 'like': 2, 'other': 2, 'an': 2, 'house': 2, 'been': 2, 'important': 2, 'first': 2, '2014': 2, 'by': 2, 'because': 2, 'pictures': 2, 'read': 2, 'grace': 2, "year's": 2, 'year': 2, 'memory': 2, 'books': 2, 'publication': 2, 'good': 2, 'it': 2, 'sort': 2, 'something': 2, 'look': 2, 'story': 2, 'who': 2, 'art': 2, 'paper': 2, 'using': 2, 'kind': 2, 'them': 2, 'users': 2, 'studio': 1, 'company': 1, 'souza': 1, 'founding': 1, 'hashtag': 1, 'longtime': 1, 'give': 1, 'countries': 1, 'resolution': 1, 'less': 1, 'alert': 1, 'professionals': 1, 'air': 1, 'investors': 1, '54': 1, 'eggleston': 1, 'fullest': 1, 'month': 1, 'galleries': 1, 'very': 1, 'apps': 1, 'things': 1, 'following': 1, '2011': 1, 'documentary': 1, 'rolling': 1, 'creating': 1, 'create': 1, 'differently': 1, 'stones': 1, 'successful': 1, 'much': 1, 'composition': 1, 'eerie': 1, 'next': 1, 'feature': 1, 'best': 1, 'floyd': 1, 'far': 1, 'medium': 1, 'one-man': 1, 'pete': 1, 'prestigious': 1, 'street': 1, 'set': 1, 'published': 1, 'legendary': 1, 'when': 1, 'partners': 1, 'two-time': 1, 'your': 1, 'has': 1, 'follows': 1, 'ran': 1, 'valley': 1, 'hoping': 1, 'dark': 1, 'not': 1, 'understandably': 1, 'aerial': 1, 'right': 1, 'shangri-la': 1, 'submissions': 1, 'up': 1, "europe's": 1, 'pocket': 1, 'started': 1, 'smartphone': 1, 'decaying': 1, 'inside': 1, 'camera': 1, 'confusing': 1, 'nightclubs': 1, 'you': 1, 'sought': 1, 'cameras': 1, 'think': 1, '2013': 1, 'own': 1, 'democratizing': 1, 'counted': 1, 'splendor': 1, 'award-winning': 1, 'hiring': 1, 'portable': 1, 'projects': 1, 'festival': 1, 'themselves': 1, 'richard': 1, 'most': 1, 'turning': 1, 'quality': 1, 'astounded': 1, "italy's": 1, 'diverse': 1, 'life': 1, 'entry': 1, 'believes': 1, 'have': 1, 'works': 1, 'geometry': 1, 'gone': 1, 'fine': 1, 'can': 1, 'mix': 1, 'photo-sharing': 1, "didn't": 1, 'while': 1, 'selection': 1, 'fix': 1, 'new': 1, 'put': 1, 'ambition': 1, "i'm": 1, 'beginning': 1, 'know': 1, 'hernandez': 1, 'preserving': 1, 'skill': 1, 'gave': 1, 'keep': 1, 'peek': 1, 'paris-based': 1, 'start': 1, 'pierre': 1, 'me': 1, 'into': 1, 'motor': 1, 'imitation': 1, 'online': 1, 'style': 1, 'ton': 1, 'days': 1, 'if': 1, 'including': 1, 'annual': 1, 'purchase': 1, 'concept': 1, 'photos': 1, 'led': 1, 'advanced': 1, 'hand': 1, 'between': 1, 'chance': 1, 'him': 1, 'will': 1, 'had': 1, 'white': 1, 'lowy': 1, 'too': 1, 'before': 1, 'end': 1, 'chief': 1, 'pink': 1, 'koci': 1, 'several': 1, 'available': 1, 'become': 1, 'amateur': 1, 'through': 1, 'wanted': 1, 'technical': 1, 'curate': 1, 'italian': 1, 'about': 1, 'unseen': 1, 'well': 1, 'becoming': 1, 'impressed': 1, 'sensibility': 1, 'full': 1, 'outofthephone': 1, 'moriyama': 1, 'receive': 1, 'their': 1, 'help': 1, 'benjamin': 1, 'grainy': 1, 'forward': 1, 'deserve': 1, 'monet': 1, 'abandoned': 1, 'william': 1, 'forget': 1, 'get': 1, 'use': 1, 'way': 1, 'prize-nominated': 1, 'promoting': 1, 'throttle': 1, 'expanding': 1, 'hobbyists': 1, 'try': 1, 'operation': 1, 'coupled': 1, 'showcase': 1, 'scenes': 1, "today's": 1, 'taken': 1, 'these': 1, 'tommy': 1, "world's": 1, 'anthology': 1, 'official': 1, 'debut': 1, 'behind': 1, 'work': 1, 'pulitzer': 1, '25': 1, '100': 1, 'number': 1, 'perhaps...i': 1, 'known': 1, 'fresh': 1, 'founder': 1, 'cnn': 1, 'seeing': 1, 'feeling': 1, 'desire': 1, 'established': 1, 'poor': 1, '20,000': 1, 'supercar': 1, 'preselected': 1, 'nominate': 1, 'printing': 1, 'daido': 1, 'over': 1, 'form': 1, 'captures': 1, 'last': 1, 'solicit': 1, 'his': 1, 'release': 1, 'room': 1, 'as': 1, 'surprised': 1, 'platform': 1, 'tangible': 1, 'clients': 1, 'consideration': 1, 'inaugural': 1, 'dedicated': 1})] 

Comments

Popular posts from this blog

css - SVG using textPath a symbol not rendering in Firefox -

Java 8 + Maven Javadoc plugin: Error fetching URL -

node.js - How to abort query on demand using Neo4j drivers -