阅读量

原创教程,严禁转载。引用本文,请署名 Python中文网, http://www.zglg.work


第一个Python小项目

上下文关键字(KWIC, Key Word In Context)是最常见的多行协调显示格式。

此小项目描述:输入一系列句子,给定一个给定单词,每个句子中至少会出现一次给定单词 。目标输出,给定单词按照KWIC显示,KWIC显示的基本要求:待查询单词居中,前面pre序列右对齐,后面post序列左对齐,待查询单词前和后长度相等,若输入句子无法满足要 求,用空格填充。

输入参数:输入句子sentences, 待查询单词selword, 滑动窗口长度window_len

举例,输入如下六个句子,给定单词secure,输出如下字符串:

               pre keyword    post

     welfare , and secure  the blessings of
     nations , and secured immortal glory with
       , and shall secure  to you the
    cherished . To secure  us against these
     defense as to secure  our cities and
          I can to secure  economy and fidelity

请补充实现下面函数:

def kwic(sentences: List[str], selword: str, window_len: int) -> str:
    """
    :type: sentences: input sentences
    :type: selword: selected word
    :type: window_len: window length
    """

更多KWIC显示参考如下:

http://dep.chs.nihon-u.ac.jp/english_lang/tukamoto/kwic_e.html

此项目的完整代码和分析已发布在 Python中文网

以下代码都经过测试,完整可运行,当然错误可能还是再所难免,欢迎指正,提交链接:https://github.com/jackzhenguo/python-small-examples/issues

# encoding: utf-8
"""
@file: kwic_service.py
@desc: providing functions about KWIC presentation
@author: group3
@time: 5/9/2021
"""

import re
from typing import List

获取关键词sel_word的窗口,默认窗口长度为5

def get_keyword_window(sel_word: str, words_of_sentence: List, length=5) -> List[str]:
    """
    find the index of sel_word at sentence, then decide words of @length size
    by backward and forward of it.
    For example: I am very happy to this course of psd if sel_word is happy, then
    returning: [am, very, happy, to, this]

    if length is even, then returning [very, happy, to, this]

    remember: sel_word being word root
    """
    if length <= 0 or len(words_of_sentence) <= length:
        return words_of_sentence
    index = -1
    for iw, word in enumerate(words_of_sentence):
        word = word.lower()
        if len(re.findall(sel_word.lower(), word)) > 0:
            index = iw
            break

    if index == -1:
        # log.warning("warning: cannot find %s in sentence: %s" % (sel_word, words_of_sentence))
        return words_of_sentence
    # backward is not enough
    if index < length // 2:
        back_slice = words_of_sentence[:index]
        # forward is also not enough,
        # showing the sentence is too short compared to length parameter
        if (length - index) >= len(words_of_sentence):
            return words_of_sentence
        else:
            return back_slice + words_of_sentence[index: index + length - len(back_slice)]
    # forward is not enough
    if (index + length // 2) >= len(words_of_sentence):
        forward_slice = words_of_sentence[index:len(words_of_sentence)]
        # backward is also not enough,
        # showing the sentence is too short compared to length parameter
        if index - length <= 0:
            return words_of_sentence
        else:
            return words_of_sentence[index - (length - len(forward_slice)):index] + forward_slice

    return words_of_sentence[index - length // 2: index + length // 2 + 1] if length % 2 \
        else words_of_sentence[index - length // 2 + 1: index + length // 2 + 1]

KWIC显示逻辑:

def kwic_show(sel_language, words_of_sentence, sel_word, window_size=9, align_param=70, token_space_param=1):
    """return kwic string for words_of_sentence and sel_word being key token
    :param sel_language: selected language
    :param words_of_sentence: all words in one sentence
    :param sel_word: key token
    :param window_size: size of kwic window
    :param align_param: parameters used to align the display
    :param token_space_param: space length before or after keyword

    window_size and align_param's default value is not suggested to revise
    """
    if window_size < 1:
        return None
    if window_size >= len(words_of_sentence):
        window_size = len(words_of_sentence)

    words_in_window = get_keyword_window(sel_word, words_of_sentence, window_size)

    sent = ' '.join(words_in_window)
    # TODO: better to use token after lemmatization to sel_word
    try:
        key_index = sent.lower().index(sel_word.lower())
    except ValueError as ve:
        # log.warning('%s not in sentence %s' % (sel_word, sent))
        key_index = -1
    if key_index == -1:
        return None, None

    align_param = align_param - len(sel_word) - 2 * token_space_param
    if align_param < 0:
        log.warning('align_param value required bigger length of input word')
        return None, None
    pre_part = sent[:key_index].rstrip()
    # dealing with the problem of too long string on the left side of keyword
    i, n_pre_words = 1, len(pre_part.split(' '))
    while i < n_pre_words and len(pre_part) > align_param // 2:
        pre_words = pre_part.split(' ')
        pre_words = pre_words[i:]
        pre_part = " ".join(pre_words)
        i += 1

    pre_kwic = pre_part.rjust(align_param // 2)
    key_kwic = token_space_param * ' ' + sent[key_index: key_index + len(sel_word)].lstrip() + token_space_param * ' '

    # dealing with the problem of too long string on the right side of keyword
    post_kwic = sent[key_index + len(sel_word):].lstrip()
    n_post_words = len(post_kwic.split(' '))
    i = n_post_words - 1
    while i > 0 and len(post_kwic) > align_param // 2:
        post_kwic_words = post_kwic.split(' ')
        post_kwic_words = post_kwic_words[:i]
        post_kwic = " ".join(post_kwic_words)
        i -= 1

    sel_word_kwic = pre_kwic + key_kwic + post_kwic
    return sel_word_kwic, pre_kwic

测试代码

# encoding: utf-8
"""
@file: test_kwic_show.py
@desc:
@author: group3
@time: 5/3/2021
"""
from src.feature.kwic import kwic_show

if __name__ == '__main__':
    words = ['I', 'am', 'very', 'happy', 'to', 'this', 'course', 'of', 'psd']

    print(kwic_show('English', words, 'I', window_size=1)[0])
    print(kwic_show('English', words, 'I', window_size=5)[0])

    print(kwic_show('English', words, 'very', token_space_param=5)[0])
    print(kwic_show('English', words, 'very', window_size=6, token_space_param=5)[0])
    print(kwic_show('English', words, 'very', window_size=1, token_space_param=5)[0])

    # test boundary
    print(kwic_show('English', words, 'stem', align_param=20)[0])
    print(kwic_show('English', words, 'stem', align_param=100)[0])
    print(kwic_show('English', words, 'II', window_size=1)[0])
    print(kwic_show('English', words, 'related', window_size=10000)[0])

打印结果

                                  I 
                                  I am very happy to
                        I am     very     happy to this course of psd
                        I am     very     happy to this
                                 very     
None
None
None
None

Python 20个专题完整目录:

Python前言

Google Python代码风格指南

Python数字

Python正则之提取正整数和大于0浮点数

Python字符串

CSV读写乱码问题

Unicode标准化

Unicode, UTF-8, ASCII

Python动态生成变量

Python字符串对齐

Python小项目1:文本句子关键词的KWIC显示

Python列表

Python流程控制

Python编程习惯专题

Python函数专题

Python面向对象编程-上篇

Python面向对象编程-下篇

Python十大数据结构使用专题

Python包和模块使用注意事项专题

Python正则使用专题

Python时间专题

Python装饰器专题

Python迭代器使用专题

Python生成器使用专题

Python 绘图入门专题

Matplotlib绘图基础专题

Matplotlib绘图进阶专题

Matplotlib绘图案例

NumPy图解入门