扫描版 PDF 识别单词读音

发表于 2023-06-14 更新于 2024-01-08 分类于生活阅读次数： Waline：

使用编程技能解决生活中遇到的问题。

最初的想法：我想提升一下英语单词的词汇量，可以是传统的背单词方式太枯燥，所以我在想可以是不是把单词分类记忆。分类后的单词彼此相关性会强一些，便于记忆。

当我在网上搜索时，发现了一本《老外每天在用的生活词汇》满足我刚刚的想法，非常好，现在单词已经有人帮我分好类了。可惜的是我只找到一本扫描版的 PDF，书本知识的缺点是没有发音和例句，虽然我在网上找到了这本书配套的 Mp3 音频，可是里面也只有发音，且需要像磁带一样一个章节一个章节的顺序播放。

这时我想到了常用的浏览器插件“沙拉查词”，要是这个 PDF 是线上的，鼠标点击单词给个弹窗就好了，这样不仅可以查看发音和例句，内容也可以更丰富。这学习体验多好啊，单词肯定记得也快！

好了，决定了，背单词的事情先放一放，我们先磨刀。

想把刚刚的想法实现，不是一个简单的事情，现在我们尝试把他分解一下：

扫描版的 PDF 上传网页方式
- 图片
- OCR 识别转文字
既然我们想保持书籍原本的布局和插画，就只能选图片了；
OCR 可以用来处理每页出现的单词；
如何确定单词的位置？经调查（不要小瞧这轻描淡写的三个字），OCR 可以实现！
妥了！单词和位置能确定了就好说了，html、弹窗这些就都是小事了；

ok，下一步，找一个开源的好用的 OCR 库，尝试处理一下。

去 github 搜索一番，找到了 star 数量相对最多的 tesseract。这个库不仅可以本地命令行处理，也可以调用 API，那不得找一个熟悉的语言来处理啊，C++ 咱可玩不转。果然，被我找到了 node-tesseract-ocr ，哈哈哈哈，天助我也~

在查阅了两个库的文档之后，先写个简单的 index.js 测试一下效果吧！

const Tesseract = require('node-tesseract-ocr');
const fs = require('fs');

const config = {
  lang: 'eng', // 识别语言
  oem: 1, // OCR 引擎模式
  psm: 3, // 页面分割模式
  tessedit_create_hocr: '1' // 生成 hOCR 输出格式
};

// 读取图片文件
const image = fs.readFileSync('test001.png');

// 调用 OCR API 进行识别
Tesseract.recognize(image, config)
  .then((res) => {
    // 将识别结果保存为 HTML 文件
    fs.writeFileSync('result.html', res);
    console.log('OCR result saved as result.html');
  })
  .catch(error => {
    console.error(error);
  });

使用 node 执行这个文件，我们得到一个 html 文件，里面代码是这样的：

<div class='ocr_carea'
      id='block_1_5'
      title="bbox 72 558 427 581">
  <p class='ocr_par'
      id='par_1_3'
      lang='eng'
      title="bbox 72 558 427 581">
      <span class='ocrx_word'
            id='word_1_7'
            title='bbox 103 564 148 577; x_wconf 95'
            style="position: absolute; top: 564px; left: 103px;background-color: yellow;">taste</span>
      <span class='ocrx_word'
            id='word_1_8'
            title='bbox 156 561 258 581; x_wconf 67'
            style="position: absolute; top: 561px; left: 156px;background-color: yellow;">good/bad</span>
    </span>
  </p>
</div>

这里并没有放出全部代码，主体代码基本规律是：有多个 ocr_carea 类名的 div，里面包含若干个 ocr_par 类名的 p 标签，然后里面包含若干个类名 ocrx_word 的 span 标签。

span 标签上的 title 属性就是文本的边界框，style 属性是我加上去测试这个边界框位置的（下图黄色部分）。其实这个位置测不测都行，因为如果有偏差那肯定是所有文本都有偏差，统一调整就可以。下图是统一调整之后的效果，看起来还不错，可以满足我们的需求。

ok，取得阶段性胜利 ✌

还有一个问题，tesseract 识别出来的结果太碎了，需要筛选一下，我们只需要完整的单词！不知道正则能不能行 🤔 这个咱也玩不转，找 chatGPT 写一个吧！

我们来完善一下刚刚的代码，把 chatGPT 写的正则加上，在解析玩 DOM 之后我们对其中的文本进行筛选，不合格的行直接删掉！

...
    const traverse = node => {
      if (node.type === 'tag') {
        if (node.name === 'span' && node.attribs.class === 'ocrx_word') {
          const word = node.children.filter(child => child.type === 'text')[0].data;
          // 正则过滤正常的英文单词，长度大于 2，不符合条件的将此行删掉
          if (/[a-z]+[\-\']?[a-z]*/ig.test(word) && word.length > 2) {
            lines.push(word)
            const titleArr = node.attribs.title.split(' ');
            // 减掉的是偏移量
            const top = titleArr[2] - 9;
            const left = titleArr[1] - 10;
            node.attribs.style = `position: absolute; top: ${top}px; left: ${left}px;background-color: skyblue;`
          } else {
            const index = node.parent.children.indexOf(node);
            node.parent.children.splice(index, 1); // 从父节点中删除该节点
          }
        } else {
          node.children.forEach(child => traverse(child));
        }
      }
    };
...

看看结果 😀

哦，nice！虽然还是有很多乱码，但好了很多了，有可能是我测试的截图太糊了。没关系，我们再看看文档，优化一下。

后续图像识别方面优化：

添加了语言目录；
特殊符号使用正则筛掉；
识别正确率在 80 以上的才显示。

...
const config = {
  lang: 'eng', // 识别语言
  oem: 1, // OCR 引擎模式
  psm: 3, // 页面分割模式
  tessedit_create_hocr: '1', // 生成 hOCR 输出格式
  tessdata: './tessdata' // 语言目录
};
...
const traverse = node => {
  if (node.type === 'tag') {
    if (node.name === 'span' && node.attribs.class === 'ocrx_word') {
      const word = node.children.filter(child => child.type === 'text')[0].data;
      const titleObj = parseStrToObj(node.attribs.title);
      const { bbox, x_wconf } = titleObj;

      // 正则过滤正常的英文单词+长度大于 2+正确率大于 80，不符合条件的将此行删掉
      if (/[a-z]+[\-\']?[a-z]*/ig.test(word) && word.length > 2 && Number(x_wconf) > 80) {
        const newWord = word.replace(/@/g, '');
        lines.push(newWord)

        const index = node.parent.children.indexOf(node);
        node.parent.children[index].children[0].data = newWord;
        // 减掉的是偏移量
        const top = bbox.split(' ')[1] / imgHeight * 100;
        const left = bbox.split(' ')[0] / imgWidth * 100;
        node.attribs.style = `display: inline-block; position: absolute; top: ${top}%; left: ${left}%;background-color: skyblue;padding: 0 3px;border-radius: 3px;border: 1px solid #F56C6C`;
      } else {
        const index = node.parent.children.indexOf(node);
        node.parent.children.splice(index, 1); // 从父节点中删除该节点
      }
    } else {
      node.children.forEach(child => traverse(child));
    }
  }
};
...

嗯，好太多了 🎉！

识别的问题先到这，目前的结果已经能够满足我们的需求了，我们继续。

每本书肯定是有多张图片的，批量处理的话我们就用 fs 模块读取循环就好，处理之后结果就是若干个 html 文件。

那如何弄成一本书的形式？常见的方式，用 iframe 嵌套 html，左侧再加个目录，点击目录修改 iframe 的 src。

...
<body>
  <div class="page-wrap">
    <ul id="menu"
        class="menu"></ul>
    <div class="content">
      <iframe id="myIframe"
              src="./output/1.html"
              frameborder="0"
              style="width: 100%;height: 100%;"></iframe>
    </div>
  </div>

  <script src="./script.js"></script>
</body>
</html>

css 代码极其简单，就不放了，页面大概长下面这样，丑点，我们先实现功能。

刚刚的 html 中我们引入了一个 script.js 文件，想要自动化可以在图片识别处理完成后修改 script.js 文件的内容即可：

// 替换 script.js 中的内容为指定的 JavaScript 代码
function setScript () {
  const newScript = `
  const menu = document.querySelector('#menu');
  const iframe = document.querySelector('#myIframe');
  const items = ${JSON.stringify(outPutFiles)}

  // 动态生成菜单项
  items.forEach(item => {
    const li = document.createElement('li');
    li.textContent = item.label;
    li.addEventListener('click', () => {
      iframe.src = item.value;
    });
    menu.appendChild(li);
  });
  `;
  // 将替换后的代码写回到 script.js 文件中
  fs.writeFileSync('./script.js', newScript, 'utf8');
  console.log('script 替换成功')
}

其中的变量 outPutFiles 就是左侧目录的数据。

到这里，我们的主体功能基本已经实现了，现在已经可以翻页、可以使用沙拉查词来选取单词发音了 🎉

虽然还是有点粗糙，这个后续我们再优化吧，我先去把单词背了 🙃

代码地址：maqingbo/OCR-Ebook