背景

  • 在做URL日志与算法之前通常需要对URL进行特征化和归一化,本文介绍部分URLparse相关的库和方法Demo。

urlparse解析url的query并构建字典

  1. import urlparse
  2. url = "http://www.example.org/b.html?a=1&b=2#abc=a"
  3. print(urlparse.urlsplit(url))
  4. print(urlparse.parse_qs(urlparse.urlsplit(url).query))
  5. print(dict(urlparse.parse_qsl(urlparse.urlsplit(url).query)))
  6. ===
  7. SplitResult(scheme='http', netloc='www.example.org', path='/b.html', query='a=1&b=2', fragment='abc=a')
  8. {'a': ['1'], 'b': ['2']}
  9. {'a': '1', 'b': '2'}

注意:

  1. 在Python3中, urlparse已经被移动到urllib.parse中。

  2. urlparse中有两个函数:urlparse.parse_qs()urlparse.parse_qsl()。这两个函数都能解析url中的query字段。如果url的query中有同一个key对应多个value,其中urlparse.parse_qs()可以把该相同key的value放在一个list中。

  3. 有时间测试一下,如果url的query中有同一个key对应多个value,那么服务端要怎样接收。

url解码

  1. >>> import urlparse
  2. >>> from urlparse import unquote
  3. >>> url = "http://www.google.com/support/contact/bin/request.py?entity=%7B%22author%22:%22AIe9_BEW4fia2hKVVTrlUwNzhLS-jMdh3isj0rMd7_Cw85R1-YlRNFkUwoDyhH4aMj7AdHsW5A1po8BinbxspAuLBdB-or_3YzCMNXZKYrb50MIIJCZEpb4%22,%22groups%22:%5B%22general%22,%2254296%7C700726330%22%5D,%22trustedMerchantId%22:%22MID_54316%22%7D&client=242&contact_type=anno&hl=en_US"
  4. >>> a = urlparse.urlparse(url).query
  5. >>> b = unquote(a)
  6. >>> b
  7. 'entity={"author":"AIe9_BEW4fia2hKVVTrlUwNzhLS-jMdh3isj0rMd7_Cw85R1-YlRNFkUwoDyhH4aMj7AdHsW5A1po8BinbxspAuLBdB-or_3YzCMNXZKYrb50MIIJCZEpb4","groups":["general","54296|700726330"],"trustedMerchantId":"MID_54316"}&client=242&contact_type=anno&hl=en_US'
  8. >>> import HTMLParser
  9. >>> html_parser = HTMLParser.HTMLParser()
  10. >>> txt = html_parser.unescape(b)
  11. >>> txt
  12. u'entity={"author":"AIe9_BEW4fia2hKVVTrlUwNzhLS-jMdh3isj0rMd7_Cw85R1-YlRNFkUwoDyhH4aMj7AdHsW5A1po8BinbxspAuLBdB-or_3YzCMNXZKYrb50MIIJCZEpb4","groups":["general","54296|700726330"],"trustedMerchantId":"MID_54316"}&client=242&contact_type=anno&hl=en_US'
  13. >>> c = urlparse.parse_qsl(txt, True)
  14. >>> c # c是一个list
  15. [(u'entity', u'{"author":"AIe9_BEW4fia2hKVVTrlUwNzhLS-jMdh3isj0rMd7_Cw85R1-YlRNFkUwoDyhH4aMj7AdHsW5A1po8BinbxspAuLBdB-or_3YzCMNXZKYrb50MIIJCZEpb4","groups":["general","54296|700726330"],"trustedMerchantId":"MID_54316"}'), (u'client', u'242'), (u'contact_type', u'anno'), (u'hl', u'en_US')]
  16. >>> import json
  17. >>> c = dict(c)
  18. >>> d = json.loads(c['entity'])
  19. >>> d
  20. {u'trustedMerchantId': u'MID_54316', u'groups': [u'general', u'54296|700726330'], u'author': u'AIe9_BEW4fia2hKVVTrlUwNzhLS-jMdh3isj0rMd7_Cw85R1-YlRNFkUwoDyhH4aMj7AdHsW5A1po8BinbxspAuLBdB-or_3YzCMNXZKYrb50MIIJCZEpb4'}
  21. >>> print d['groups'][-1]
  22. 54296|700726330

计算Hash

  1. from hashlib import md5
  2. url = "http://wxapp.1688.com/wx/offer/576822531818?app_id=e88kTif9Bs&_app_id=e88kTif9Bs&session_key="
  3. print(md5(url).hexdigest())