misskey/sharkey/mastodon 机器人笔记

概述

笔记本分为studio与work，work为空，studio中包含：爬虫学、Feedbot、Misskey.py

Misskey.py：misskey.py 是misskey api的python格式化工具，可以便捷地使用api执行命令，因为file上传的字段名与firefish有所不同，想要拉下来修改包装成firefish.py，但测试失败，没有掌握pip库打包调试的方法。可以弃置。
Feedbot：是一些机器人脚本的小实现，一些片段可以保留
爬虫学：包含学习爬虫的笔记，参考的是Zotero-文库-编程-《Python网络爬虫权威指南（第2版）》

爬虫笔记(scraping_note)

世间的知识臃杂繁多,若要选出一条原则来将其分个三六九等,我以为"经世致用"一词最为恰当,认识的最终目的就是实践嘛。大多数人学的大多数东西都在学习阶段过后丢失,把学习当作一种纯粹的认识活动,在非常可惜,明明可以有更多的用途的,不将其运用到实实在在的东西上论其为暴敛天物也不为过。实践有困难的,有简单的,有所见即所得,有所见非所得的,劳动的本质就是物质的搬运加工，爬虫作为一种劳动，是值得研究一番的，下面将以爬虫为例,来进行学习应用。

关键笔记

beautiful soup 可以很便捷地抓取页面信息。
要熟悉html、dom才能更高效地利用此工具，

解析器

python中的解析器有如下几种：

常见的html.parser
lxml:单独安装，可以容忍机构杂乱的html
html5lib：具备更加强大的容错性，速度更慢。

异常处理

对于各种情况进行处理，以防无输出不知所措。

HTTPError：网页在服务器不存在
UrlError：服务器不存在


from urllib.request import urlopen 
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup  

try:
    html = urlopen('https://blog.si-on.top/about/') 
except HTTPError as e :
    print("报错啦：",e)
except URLError as e :
    print("服务器找不着了！")    
else:
    bs = BeautifulSoup(html.read(), 'html.parser') 
    testtag = bs.h1
    print(testtag)

from urllib.request import urlopen 
from bs4 import BeautifulSoup  
html = urlopen('https://sunsetbot.top/map/') 
bs = BeautifulSoup(html.read(), 'html.parser') 

img_tag = bs.find('img', {'id': 'map_img_src'})

if img_tag:
    img_src = img_tag.get('src')
    firemap = "https://sunsetbot.top" + img_src
else:
    print("未找到指定的图片标签")

任何 HTML(或 XML) 文件的任意节点信息都可以被提取出来,只要目标信息的旁边或附近有标签就行。

定位信息(标签)

bs4抓取信息的过程就是一个筛选的过程。现代网页大多都有很丰富的css（层叠样式表），可以从css的class与id里来筛选出大多数标签。

find_all(tag, attributes, recursive, text, limit, keywords)

tag: 抓取对应标签名，单个标签直接写'标签名'，如find_all('div').对于多个标签，可以写成列表的形式：find_all(['div1','div2','div3'])
attributes:抓取该标签下的若干属性和对应的属性值。如抓取子虚栈的“玩物”页面的设备标签：

from urllib.request import urlopen 
from bs4 import BeautifulSoup  
html = urlopen('https://blog.si-on.top/AwesomeHub/equipment/') 
bs = BeautifulSoup(html.read(), 'html.parser') 

list = bs.findAll('div',["equipment"]) # 输出div标签下的"equipment"
print(list)

print("测试2")

black_list = bs.find_all('div',["icat-equipment-name"],string=lambda text: text and '黑' in text)
for self in black_list:
     print(self.get_text())

定位信息(导航树)

数据储存

from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
from bs4 import BeautifulSoup  
import re

html = urlopen('https://sunsetbot.top/map/') 
bs = BeautifulSoup(html.read(), 'html.parser') 
img_tag = bs.find('img', {'id': 'map_img_src'})
img_re_src = img_tag.get('src')

# 对URL中的非ASCII字符进行编码
encoded_path = quote(img_re_src, safe='/%')
firemap = "https://sunsetbot.top" + encoded_path

urlretrieve(firemap, '0828.jpg')

misskey.py

便捷测试

from misskey import Misskey
from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
from bs4 import BeautifulSoup  
import re


mk = Misskey("https://si-on.top", i="******")  # Input instance address (If leaved no attribute, it sets "misskey.io")

html = urlopen('https://sunsetbot.top/map/') 
bs = BeautifulSoup(html.read(), 'html.parser') 
img_tag = bs.find('img', {'id': 'map_img_src'})
img_re_src = img_tag.get('src')

# 对URL中的非ASCII字符进行编码
encoded_path = quote(img_re_src, safe='/%')
firemap = "http://sunsetbot.top" + encoded_path
print(firemap)
# urlretrieve(firemap, '902.jpg')

with open('902.jpg', "rb") as f:
    data = mk.drive_files_create(f)

信息自动机器人

时间线的小逆流，抵抗大众的洪流

功能需求

第一类机器人：无源机器人(Bot)，抓取信息，发送信息。雁过无痕，一周后自动删除，信息传播，而后枯竭，没有留存的必要。

赤翳♨️（火烧云预测）：抓取信息，每天预报日出、日落的火烧云情况。

♨️赤翳一隅

from misskey import Misskey
from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
import urllib.request
from urllib.error import HTTPError, URLError
import os
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import re

botkey = "****" #api密钥
sunsettime = datetime.now().strftime("%Y年%m月%d日")

mk = Misskey("https://实例地址", i= botkey )  # Input instance address (If leaved no attribute, it sets "misskey.io")

html = urlopen('https://sunsetbot.top/map/')
bs = BeautifulSoup(html.read(), 'html.parser')
img_tag = bs.find('img', {'id': 'map_img_src'})
img_re_src = img_tag.get('src')

print(img_re_src)

# 对URL中的非ASCII字符进行编码
encoded_path = quote(img_re_src, safe='/%')
firemap = "http://sunsetbot.top" + encoded_path
print(firemap)


def download_image(url, save_path='firecloud.jpg'):
    try:
        # 添加请求头，模拟浏览器访问
        headers = {'User-Agent': 'Mozilla/5.0'}
        req = urllib.request.Request(url, headers=headers)

        print(f"尝试下载: {url}")

        # 下载文件
        urllib.request.urlretrieve(url, save_path,
                                 lambda blocknum, blocksize, totalsize:
                                 print(f"下载进度: {blocknum * blocksize}/{totalsize if totalsize > 0 else '未知'}", end='\r'))

        print(f"\n文件已保存到: {save_path}")
        return True
    except HTTPError as e:
        print(f"HTTP错误 {e.code}: {e.reason}")
        print(f"请检查URL是否正确: {url}")
        return False
    except URLError as e:
        print(f"URL错误: {e.reason}")
        return False
    except Exception as e:
        print(f"其他错误: {str(e)}")
        return False

download_image(firemap)

with open('firecloud.jpg', "rb") as f:
  data = mk.drive_files_create(f)
fc = data['id']
print(fc)

fc = data['id']

note = mk.notes_create(
        text=f"""🌇{sunsettime}火烧云预报来啦！#预报 #一隅""".strip(),
        file_ids = [fc],
    )

print(note)

雾笼🌫️（雾气预测）：抓取信息，每晚预报第二天的雾气情况。

🌫️雾笼一隅

from misskey import Misskey
from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
import urllib.request
from urllib.error import HTTPError, URLError
import os
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import re

botkey = "******" #api密钥
fogtime = datetime.now().strftime("%Y年%m月%d日")

mk = Misskey("https://实例地址", i= botkey )  # Input instance address (If leaved no attribute, it sets "misskey.io")

html = urlopen('https://www.nmc.cn/publish/fog.html')

bs = BeautifulSoup(html.read(), 'html.parser')
img_tag = bs.find('img', {'id': 'imgpath'})
img_re_src = img_tag.get('src')   
img_raw = re.sub(r'/medium', '', img_re_src)
print(img_raw)

def download_image(url, save_path='fog.jpg'):
    try:
        # 添加请求头，模拟浏览器访问
        headers = {'User-Agent': 'Mozilla/5.0'}
        req = urllib.request.Request(url, headers=headers)

        print(f"尝试下载: {url}")

        # 下载文件
        urllib.request.urlretrieve(url, save_path,
                                 lambda blocknum, blocksize, totalsize:
                                 print(f"下载进度: {blocknum * blocksize}/{totalsize if totalsize > 0 else '未知'}", end='\r'))

        print(f"\n文件已保存到: {save_path}")
        return True
    except HTTPError as e:
        print(f"HTTP错误 {e.code}: {e.reason}")
        print(f"请检查URL是否正确: {url}")
        return False
    except URLError as e:
        print(f"URL错误: {e.reason}")
        return False
    except Exception as e:
        print(f"其他错误: {str(e)}")
        return False

download_image(img_raw)

with open('fog.jpg', "rb") as f:
  data = mk.drive_files_create(f)
fc = data['id']
print(fc)

fc = data['id']

note = mk.notes_create(
        text=f"""🌫️{fogtime}雾预报来啦！#预报 #一隅 #foggy #雾天""".strip(),
        file_ids = [fc],
    )

print(note)

XKCD中文站：抓取中文站的漫画，每五点监控更新发布。

XKCD抓取

from misskey import Misskey  
from urllib.request import urlopen, urlretrieve  
from urllib.parse import quote  
from bs4 import BeautifulSoup  
import requests  
from datetime import datetime  
import re  
import os  
  
botkey = "******" #api密钥  
# sunsettime = datetime.now().strftime("%Y年%m月%d日")  
  
mk = Misskey("https://实例地址", i= botkey )  # Input instance address (If leaved no attribute, it sets "misskey.io")  
  
  
# 变动监控  
html = urlopen('https://xkcd.in/')  
bs = BeautifulSoup(html.read(), 'html.parser')  
least_pic = bs.find('div', {'id': 'strip_list'})  
if least_pic:  
    first_link = least_pic.find('a')  
    if first_link:  
        text = first_link.get_text()  
        match = re.search(r'\[(\d+)\]', text)  
        if match:  
            latest_id = match.group(1)  
            print(f"最新文章ID: {latest_id}")  
  
# 图片抓取  
pic_file = f"{latest_id}.png"  
  
# 增加条件判断：如果文件已存在则不下载  
if not os.path.exists(pic_file):  
    pic_main = bs.find('div', {'class': 'comic-body'})  
    pic_link = 'https://xkcd.in/' + pic_main.find('img')['src']  
    pic_title = pic_main.find('img')['title']  
    pic_detail = bs.find('div', {'class': 'comic-details'}).get_text()  
    urlretrieve(pic_link, pic_file)  
    print(f"图片已下载: {pic_file}")  
    with open(pic_file, "rb") as f:  
      data = mk.drive_files_create(f,name=pic_title)  
    fc = data['id']  
    print(fc)  
  
    fc = data['id']  
  
    note = mk.notes_create(  
            text=f"""**{pic_title}**\n 『*{pic_detail}*』 \n #xkcd #漫画 #汉化 #一隅""".strip(),  
            file_ids = [fc],  
    )  
    print(note)  
else:  
    print(f"图片已存在，跳过下载: {pic_file}")

~~双日一言（美文阅读）：抓取more2read信息，每周一、三、五、日发送，格式化文本。~~轻松调频于2025.12.23已经停播

第一类机器人独立版：有源机器人，发送利用维护好的数据库中的信息。重复周期最起码一个季度，且同一季度内的任意两条信息不得在两个季度的范围内重复。古文机器人：维护一个古文数据库，每日发送一篇古文。

第二类机器人：图灵机器人（可交互）。

RSS机器人：A(用户)指定命令“/订阅”+“链接”后，机器人把链接添加到订阅列表，并返回输出，包括订阅结果、此机器人rss链接、订阅的两条展示。而后机器人在这条嘟的下面回复若干条投票(仅仅用户A可见)，来让用户指定该条订阅的更新频率，展示样式(如标签、频道等)，自定义效果(内置变量)等，每次投票确定后机器人用缓存的rss来输出展示效果，最后确定样式后机器人删除嘟文下所有结果。
自动微积分机器人：维护好积分题目
偏好型机器人，抓取本地博文信息，对于符合兴趣的加以回应（设定性格、偏好），转载，或者评论。利用联网自洽的数据进行模拟信息发布。发布频率、时间限制，尽可能模拟真人。的数据库，每日发送。

Firefish遗留代码

发送当天火烧云（firefish）

from misskey import Misskey
from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
from bs4 import BeautifulSoup  
import requests
from datetime import datetime
import re

botkey = "Hjknkjndfjksldfjsdfls" #api密钥
sunsettime = datetime.now().strftime("%Y年%m月%d日")

class FirefishAPI:
    def __init__(self, base_url, api_key=None, bearer_token=None):
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
        self.bearer_token = bearer_token

    def _headers(self):
        headers = {'Content-Type': 'application/json'}
        if self.bearer_token:
            headers['Authorization'] = f'Bearer {self.bearer_token}'
        return headers

    def _body(self, data):
        if self.api_key:
            data = dict(data or {})
            data['i'] = self.api_key
        return data

    def call(self, path, method='POST', data=None, files=None):
        url = f"{self.base_url}{path}"
        headers = self._headers()
        if files:
            # multipart/form-data
            response = requests.post(url, headers=headers, files=files, data=self._body(data))
        else:
            response = requests.request(method, url, headers=headers, json=self._body(data))
        response.raise_for_status()
        try:
            return response.json()
        except Exception:
            return response.text

    def notes_create(self, data):
        return self.call('/notes/create', data=data)

mk = Misskey("https://si-on.top", i= botkey )  # Input instance address (If leaved no attribute, it sets "misskey.io")

html = urlopen('https://sunsetbot.top/map/') 
bs = BeautifulSoup(html.read(), 'html.parser') 
img_tag = bs.find('img', {'id': 'map_img_src'})
img_re_src = img_tag.get('src')

# 对URL中的非ASCII字符进行编码
encoded_path = quote(img_re_src, safe='/%')
firemap = "http://sunsetbot.top" + encoded_path
print(firemap)
urlretrieve(firemap, '904.jpg')

with open('904.jpg', "rb") as f:
  data = mk.drive_files_create(f)
fc = data['id']
print(fc)

fc = data['id']

if __name__ == "__main__":
    api = FirefishAPI(
        base_url="https://si-on.top/api",
        api_key= botkey ,  
    )

note = api.notes_create({
        "text": f"""{sunsettime}火烧云""".strip(),
        "lang": "zh-hans" ,
        "fileIds" : [fc],
    })

print(note)

最小发送(Firefish)

from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
import requests
import re
import json
import time
import schedule
from datetime import datetime
import os

botkey = "Hjknkjndfjksldfjsdfls" #api密钥

class FirefishAPI:
    def __init__(self, base_url, api_key=None, bearer_token=None):
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
        self.bearer_token = bearer_token

    def _headers(self):
        headers = {'Content-Type': 'application/json'}
        if self.bearer_token:
            headers['Authorization'] = f'Bearer {self.bearer_token}'
        return headers

    def _body(self, data):
        if self.api_key:
            data = dict(data or {})
            data['i'] = self.api_key
        return data

    def call(self, path, method='POST', data=None, files=None):
        url = f"{self.base_url}{path}"
        headers = self._headers()
        if files:
            # multipart/form-data
            response = requests.post(url, headers=headers, files=files, data=self._body(data))
        else:
            response = requests.request(method, url, headers=headers, json=self._body(data))
        response.raise_for_status()
        try:
            return response.json()
        except Exception:
            return response.text

    def notes_create(self, data):
        return self.call('/notes/create', data=data)

def load_book_data():
    """从aaa.json文件加载书籍数据"""
    try:
        with open('aaa.json', 'r', encoding='utf-8') as f:
            return json.load(f)
    except FileNotFoundError:
        print("错误: 未找到aaa.json文件")
        return None
    except json.JSONDecodeError:
        print("错误: aaa.json文件格式不正确")
        return None

def get_next_article_index():
    try:
        with open('article_index.txt', 'r') as f:
            return int(f.read().strip())
    except FileNotFoundError:
        return 0

def save_article_index(index):
    with open('article_index.txt', 'w') as f:
        f.write(str(index))

def format_content(content_list):
    # 合并所有段落
    full_content = ''.join(content_list)
    # 获取第一个字符
    first_char = full_content[0]
    # 将第一个字符用x2包裹，其余内容保持不变
    formatted_content = f"$[x2 {first_char}]{full_content[1:]}"
    return formatted_content

def send_daily_article():
    # 加载书籍数据
    book_data = load_book_data()
    if not book_data:
        print("无法加载书籍数据，请检查aaa.json文件")
        return
    
    api = FirefishAPI(
        base_url="https://si-on.top/api",
        api_key=botkey,  
    )
    
    # 获取当前文章索引
    current_index = get_next_article_index()
    
    # 检查索引是否有效
    if current_index >= len(book_data["articles"]):
        current_index = 0  # 如果索引超出范围，重置为0
    
    # 获取当前文章
    article = book_data["articles"][current_index]
    
    # 格式化内容
    formatted_content = format_content(article["content"])
    
    # 创建嘟文
    note_text = f"""
<center>**$[font.serif $[fg.color=67B7F7 {article['title']} ]]**</center>
<center>                    $[font.serif {book_data['name']}]</center>
$[font.serif {formatted_content}]
#古文 #{book_data['name']}       """
    
    # 发送嘟文
    try:
        note = api.notes_create({
            "text": note_text,
            "lang": "zh-hans",
        })
        print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} 已发送: {article['title']}")
        
        # 更新索引
        next_index = (current_index + 1) % len(book_data["articles"])
        save_article_index(next_index)
    except Exception as e:
        print(f"发送失败: {e}")

定时发送机器人测试(firefish)（失败）

github有一个仓库包含了许多古文典籍的json，但是没有段落分割，都是一句一句的，还有有规律出现的空格，似乎是从epub中提取而出。

from urllib.request import urlopen, urlretrieve
from urllib.parse import quote
import requests
import re
import json
import time
import schedule
from datetime import datetime

botkey = "asfnjksdfbfbsdfsjkd52454fsd54f5sd4f5s" #api

# 庄子数据
book_data = {
  "name": "庄子",
  "description": "《庄子》又名《南华经》，是战国中期庄子及其后学所著道家经文。到了汉代以后，尊庄子为南华真人，因此《庄子》亦称《南华经》。其书与《老子》《周易》合称"三玄"。《庄子》书分内、外、杂篇，原有五十二篇，乃由战国中晚期逐步流传、揉杂、附益，至西汉大致成形，然而当时流传版本，今已失传。目前所传三十三篇，已经郭象整理，篇目章节与汉代亦有不同。内篇大体可代表战国时期庄子思想核心，而外、杂篇发展则纵横百余年，参杂黄老、庄子后学形成复杂的体系。",
  "articles": [
    {
      "title": "内篇·逍遥游",
      "content": [
        "北冥有鱼，其名为鲲。鲲之大，不知其几千里也。化而为鸟，其名为鹏。鹏之背，不知其几千里也。怒而飞，其翼若垂天之云。是鸟也，海运则将徙于南冥。南冥者，天池也。",
        "《齐谐》者，志怪者也。《谐》之言曰："鹏之徙于南冥也，水击三千里，抟扶摇而上者九万里，去以六月息者也。"野马也，尘埃也，生物之以息相吹也。天之苍苍，其正色邪？其远而无所至极邪？其视下也，亦若是则已矣。"
      ]
    },
    {
      "title": "内篇·齐物论",
      "content": [
        "南郭子綦隐机而坐，仰天而嘘，荅焉似丧其耦。颜成子游立侍乎前，曰："何居乎？形固可使如槁木，而心固可使如死灰乎？今之隐机者，非昔之隐机者也。"",
        "子綦曰："偃，不亦善乎，而问之也！今者吾丧我，汝知之乎？女闻人籁而未闻地籁，女闻地籁而不闻天籁夫！""
      ]
    }
  ]
}

class FirefishAPI:
    def __init__(self, base_url, api_key=None, bearer_token=None):
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
        self.bearer_token = bearer_token

    def _headers(self):
        headers = {'Content-Type': 'application/json'}
        if self.bearer_token:
            headers['Authorization'] = f'Bearer {self.bearer_token}'
        return headers

    def _body(self, data):
        if self.api_key:
            data = dict(data or {})
            data['i'] = self.api_key
        return data

    def call(self, path, method='POST', data=None, files=None):
        url = f"{self.base_url}{path}"
        headers = self._headers()
        if files:
            # multipart/form-data
            response = requests.post(url, headers=headers, files=files, data=self._body(data))
        else:
            response = requests.request(method, url, headers=headers, json=self._body(data))
        response.raise_for_status()
        try:
            return response.json()
        except Exception:
            return response.text

    def notes_create(self, data):
        return self.call('/notes/create', data=data)

def get_next_article_index():
    try:
        with open('article_index.txt', 'r') as f:
            return int(f.read().strip())
    except FileNotFoundError:
        return 0

def save_article_index(index):
    with open('article_index.txt', 'w') as f:
        f.write(str(index))

def format_content(content_list):
    # 合并所有段落
    full_content = ''.join(content_list)
    # 获取第一个字符
    first_char = full_content[0]
    # 将第一个字符用x2包裹，其余内容保持不变
    formatted_content = f"$[x2 {first_char}]{full_content[1:]}"
    return formatted_content

def send_daily_article():
    api = FirefishAPI(
        base_url="https://si-on.top/api",
        api_key=botkey,  
    )
    
    # 获取当前文章索引
    current_index = get_next_article_index()
    
    # 获取当前文章
    article = book_data["articles"][current_index]
    
    # 格式化内容
    formatted_content = format_content(article["content"])
    
    # 创建嘟文
    note_text = f"""
<center>**$[font.serif $[fg.color=67B7F7 {article['title']} ]]**</center>
<center>                    $[font.serif {book_data['name']}]</center>
$[font.serif {formatted_content}]
#古文 #{book_data['name']}       """
    
    # 发送嘟文
    note = api.notes_create({
        "text": note_text,
        "lang": "zh-hans",
    })
    
    print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} 已发送: {article['title']}")
    
    # 更新索引
    next_index = (current_index + 1) % len(book_data["articles"])
    save_article_index(next_index)

def job():
    # 检查是否是早上6点
    if datetime.now().hour == 6:
        send_daily_article()

if __name__ == "__main__":
    print("庄子嘟文机器人已启动...")
    print(f"将在每天早上6点发送文章，共{len(book_data['articles'])}篇文章")
    
    # 立即发送一条测试消息
    # send_daily_article()
    
    # 设置定时任务
    schedule.every().hour.do(job)
    
    # 保持程序运行
    while True:
        schedule.run_pending()
        time.sleep(60)  # 每分钟检查一次