Data Awesome
- douban
- 工商数据
- microsoft/Data-Science-For-Beginners
Crawler
- Scrapy
- crawlab-team/crawlab 分布式爬虫管理平台,支持任何语言和框架
- MontFerret/ferret
- Declarative web scraping
- BruceDone/awesome-crawler
- Go
- BruceDone/awesome-crawler
- 参考
ETL Pipeline
- Apache NiFi
- singer.io
- transferwise/pipelinewise
- Apache-2.0, Python
- Tap 为 AGPL-3.0
- Data Pipeline Framework using the singer.io spec
- postgres tap 基于 wal2json
- apache/arrow-datafusion
- dask/dask Parallel computing with task scheduling
- airbytehq/airbyte
- MIT, Java+Python+TypeScript
- nuclio/nuclio
Serverless event and data processing platform
- Apache-2.0, Go
- pditommaso/awesome-pipeline
- rudderlabs/rudder-server
- AGPL-3.0, Go, TS, React
- Segment-alternative
- 后端 PostgreSQL
- Customer Data Platform, CDP
ML Pipeline
- flyteorg/flyte Kubernetes-native workflow automation platform - Machine Learning & Data Processing
- polyaxon/polyaxon Machine Learning Platform for Kubernetes
Workflow
Archive
Wayback Machine
Dataset
File Format
- Parquet - 列格式
- 压缩比、存储效率高
- 支持嵌套数据结构
- Avro - 行格式
- 包含 schema - JSON
- 数据为 Binary
- https://zymeworks.github.io/avro-viewer/
- ORC - Optimized Row Columnar
- 记录额外索引信息
- Arrow - 内存
- 主要用于处理
- CSV, TSV
- JSON
- JSONL -
.jsonl
,.ndjson
- 每行一个 JSON
Misc
- facebookresearch/AugLy data augmentations library for audio, image, text, and video
- Profil3r – OSINT Tool To Find Social Media Profiles & Their Email Addresses
- Open-source intelligence
- ml874/Data-Science-Cheatsheet
- looker-open-source/malloy
- thalo-rs/thalo
- Event sourcing suite for Rust
Tools
- TomWright/dasel
- MIT, Go
- JSON, TOML, YAML, XML, CSV
- johnkerl/miller
- MIT, Go
- awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
- simeji/jid
- MIT, Go
- json incremental digger
- fiatjaf/jiq
- jid in jq syntax
- tomnomnom/gron
- MIT, Go
- JSON greppable
- saulpw/visidata
- GPLv3, Python
- terminal spreadsheet
- multiprocessio/dsq
- Apache-2.0, Go
- SQL for JSON, CSV, Excel, Parquet
- x2bool/xlite
- MIT, Rust
- SQLite for .xlsx, .xls, .ods
- jq
- dbcrossbar/dbcrossbar
- mdb