Monday, January 13, 2020

Python - some tricks with web scrapping ( decompose(), zip(), modify html tags, etc.)



  • If there are some junk html tab within the tab you want to scrape, e.g.


<table align="center" border="0" cellpadding="0" cellspacing="0" height="0%" summary="Scout Ticket well data content table" width="98%">
......data you want to scrape.......
<table border="0" cellpadding="0" cellspacing="0" height="0%" summary="Plan View Table" width="100%">....junk table....</table>
-----data you want to scrape
</table>

then you can use:  soup.decompose()

for table_useless in soup.find_all("table", {"summary": "Plan View Table"}):
    table_useless.decompose()
  • If there are tags within another tag, you can extract data separately and zip them together, e.g.

<td>NDIC File No: <b>12584</b></td>, <td>     API No: <b>33-007-01163-00-00</b></td>

then you can use: zip()
header_data = [html.get_contents(header.next) for header in data_points]
detail_data = [item.find('b').next if item.find('b') is not None else 'None' for item in data_points]
final_data = dict(zip(header_data, detail_data))

No comments:

Post a Comment