I have a question for Python web crawling.

Asked 2 weeks ago, Updated 2 weeks ago, 4 views

Currently, I want to try one of two things while studying crawl.

Number one is from bs4 import BeautifulSoup as BS import requests as req

url = "https://auction1.land.naver.com/auction/ca_list.php" res = req.get(url, verify = False) soup = BS(res.text, "html.parser")

arr = soup.select("table.tbl_result td>a:first-child") for a in arr: print(a.get_text(strip=True))

I don't know how to solve it because it's blocked after squeezing it up to here.

All I want to do is pick the case number, the appraiser, but if you look at the developer tool, Td, class=num, case number I think there is a location under the tag a in class=area in td There are prices in class=price in td with num_type1 and 2.

tr = soup.select("table.tbl_result tr") print(tr)

for tr in soup.select("table.tbl_result tr"): if len(tr.select("td.num")) == 0: continue title = tr.select("td.num")[0].get_text(strip=True) area = tr.select("td.area")[0].get_text(strip=True) price1 = tr.select("td.price:nth-child(2)")[0].get_text(strip=True) price2 = tr.select("td.price:nth-child(3)")[0].get_text(strip=True) print(title,"/", area, price1, price2)

There was no information as below

After that, I'm going to do it one by one, but the information didn't come out again this time, so I feel that I lack a lot of knowledge... I don't know how to approach it.

from bs4 import BeautifulSoup as BS import requests as req

url = "https://auction1.land.naver.com/auction/ca_list.php" res = req.get(url, verify = False) soup = BS(res.text, "html.parser")

And I was going to do the number of confirmed cases per day by region with another crawl from bs4 import BeautifulSoup as BS import requests as req

url = "https://news.daum.net/covid19" res = req.get(url, verify=False) soup = BS(res.text, "html.parser")

tds = soup.find_all("a.link_location") tdds = soup.find_all("a.num_location") print(tds) print(tdds)

If you look under the developer tool, there are regions and numbers written on the txt_location and num_location under the link_location, so I tried to access it, but I couldn't find any information.

from bs4 import BeautifulSoup as BS import requests as req

url = "https://news.daum.net/covid19" res = req.get(url, verify=False) soup = BS(res.text, "html.parser")

states = soup.select("span.txt_location") print(states) region=[] for state in states: if "txt_location" in state.attrs: region.append(state.get_text(strip= True)) print(region)

If I do this, the state seems to come out with local information, but I tried to google and remove the local information, but there is no information coming in the region.

If you try to approach it in a different way, as shown below... It doesn't print out at all, so I think I took the wrong approach

from bs4 import BeautifulSoup as BS import requests as req

url = "https://news.daum.net/covid19" res = req.get(url, verify=False) soup = BS(res.text, "html.parser")

tds = soup.find_all("a>span:first-child") tdds = soup.find_all("a>span:nth-child(2)") print(tds) print(tdds)

Please give me some advice.


2022-09-19 23:34

1 Answers

Don't rely on the beautiful soup, use the find_element using xpath.path.

The larger the total amount of data and the clearer the regularity, the more efficient the find_element function using xpath, and the more complex and less amount to extract, the more convenient html parsing will be.

To give you a brief description of the order,

For more information, search for the above two keywords


2022-09-19 23:34

If you have any answers or tips


© 2022 pinfo. All rights reserved.