HTML 파싱 라이브러리 BeautifulSoup

티스토리 뷰

공부/Python

HTML 파싱 라이브러리 BeautifulSoup

Bism 2016. 12. 15. 16:00

참조 사이트 :

http://dplex.egloos.com/category/Python : BeautifulSoup 예제

http://lxml.de : lxml 라이브러리로 대량의 파일 처리 가능

--- BeautifulSoup 실행 전 준비작업 ---

1) Install requests C:\>pip install requests

2) BeautifulSoup을 설치 C:\>easy_install beautifulsoup4 <--- 파이썬3 용

3) http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 에서 각자 설치한 python 버전에 맞는 lxml 파일을 다운받아 압축을 푼 후 ~\Lib\site-packages에 해당 폴더를 붙여넣기 해 준다.

예제1

import requests

from bs4 import BeautifulSoup

def spider():

base_url = "http://www.naver.com/index.html"

#storing all the information including headers in the variable source code

source_code = requests.get(base_url)

#sort source code and store only the plaintext

plain_text = source_code.text

print(plain_text)

#converting plain_text to Beautiful Soup object so the library can sort thru it

convert_data = BeautifulSoup(plain_text, 'lxml')

#sorting useful information

for link in convert_data.findAll('a', {'class': 'h_notice'}):

href = base_url + link.get('href') #Building a clickable url

title = link.string #just the text not the html

print(href) #displaying href

print(title) #displaying title

spider()

예제2 - BeautifulSoup 기반의 네이버 인기검색어 출력하기

from bs4 import BeautifulSoup

from urllib.request import urlopen

html = urlopen("http://naver.com")

soup = BeautifulSoup(html.read(), "lxml")

for row in soup.find_all("ol"):

#print(row)

for i in range(11):

try:

li = row.find_all("li", attrs={"value":str(i)})

st1 = str(li).split(">")[1]

st2 = st1.split("title=")[1] #인터파크 등의 인기검색어

print("%d => %s" % (i, st2))

except:

print("--- 네이버 인기 검색어 ---")

더 많은 설명 :

https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

'공부 > Python' 카테고리의 다른 글

Python Semaphore(Thread) (0)	2016.12.15
Python Closure 함수 (0)	2016.12.15
PyQt (0)	2016.12.15
Numpy, Scipy, Pandas, Matplotlib 설치 (0)	2016.11.28
윈도우, python3.x 에서 Django 설치 (0)	2016.11.23

공유하기 링크

페이스북
카카오스토리
트위터

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

Bisn

티스토리 뷰

HTML 파싱 라이브러리 BeautifulSoup

'공부 > Python' 카테고리의 다른 글

티스토리툴바