list

package
v0.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 13, 2026 License: AGPL-3.0 Imports: 20 Imported by: 0

Documentation

Overview

Package list provides functionality for downloading PDFs from URL lists, CSV files, or TSV files.

The package supports three input formats:

1. Plain text files (.txt or no extension):

  • One URL per line
  • Lines starting with '#' are treated as comments
  • Empty lines are ignored

2. CSV files (.csv):

  • Comma-separated values with automatic column detection
  • Intelligent parsing of paper metadata
  • Generates meaningful filenames from metadata

3. TSV files (.tsv):

  • Tab-separated values with automatic column detection
  • Same features as CSV support

Column Detection:

The package automatically detects and uses the following columns if present:

  • URL/Link columns: BestLink, BestURL, URL, Link, href (prioritizes "best" variants)
  • DOI column: Automatically converts DOIs to resolvable URLs if no direct URL is found
  • Title column: ArticleTitle, Article_Title, Paper_Title, Title
  • Authors column: Authors, Creator, Contributor
  • Year column: PublicationYear, Publication_Year, Year
  • Journal column: SourceTitle, Source_Title, Journal, Source, Publication
  • Abstract column: Abstract (for future use)

File Naming:

For CSV/TSV inputs, the package generates intelligent filenames using available metadata:

  • Format: [Year]_[FirstAuthorLastName]_[TruncatedTitle].pdf
  • Example: 2023_Smith_Climate_change_impacts.pdf
  • Falls back to row ID if metadata is insufficient

Output:

For CSV/TSV inputs, the package generates a download report with:

  • Original metadata
  • Download success status
  • Generated filename
  • Error messages for failed downloads

The report is saved as [input_filename]_report.csv in the same directory.

Usage:

// Download from plain text URL list
err := list.DownloadURLList("urls.txt")

// Download from CSV with metadata
err := list.DownloadURLList("papers.csv")

// Download from TSV with metadata
err := list.DownloadURLList("papers.tsv")

Example CSV Input:

ArticleTitle,Authors,PublicationYear,BestLink,DOI
"Climate Change Impacts","Smith, J.; Jones, M.",2023,https://example.com/paper1.pdf,10.1234/abc
"Machine Learning Review","Brown, A.",2024,,10.5678/def

The package will: 1. Parse the CSV/TSV structure 2. Detect column mappings automatically 3. Use BestLink when available, or resolve DOI to URL 4. Generate filename like "2023_Smith_Climate_Change_Impacts.pdf" 5. Download PDFs to the same directory as the input file 6. Create a report showing success/failure for each paper

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DownloadURLList

func DownloadURLList(path string) error

DownloadURLList processes a text, CSV, or TSV file containing URLs and attempts to download PDFs from each entry.

This function supports three input formats: 1. Plain text file: One URL per line 2. CSV file: Comma-separated values with intelligent column detection 3. TSV file: Tab-separated values with intelligent column detection

For CSV/TSV files, the function automatically detects columns for: - URLs (BestLink, BestURL, URL, Link, etc.) - DOIs (converts to URLs if no direct URL available) - Title, Authors, Year, Journal (used for intelligent file naming)

The function will: - Parse the input file based on its extension (.csv, .tsv, .txt) - Generate meaningful filenames using paper metadata when available - Create a download report (_report.csv) for CSV/TSV inputs - Save all files to the same directory as the input file

Parameters:

  • path: The path to a file containing URLs or paper metadata

Returns an error if the function fails to open or read the input file, but continues processing even if individual URLs fail to download.

Types

type ColumnMapping added in v0.9.2

type ColumnMapping struct {
	URL      int
	DOI      int
	Title    int
	Authors  int
	Year     int
	Journal  int
	Abstract int
}

ColumnMapping holds the detected column indices for relevant fields

type ConcurrentDownloader added in v0.9.3

type ConcurrentDownloader struct {
	// contains filtered or unexported fields
}

ConcurrentDownloader manages concurrent downloads with per-host and global limits

func NewConcurrentDownloader added in v0.9.3

func NewConcurrentDownloader(maxGlobal, maxPerHost int64) *ConcurrentDownloader

NewConcurrentDownloader creates a new concurrent downloader with specified limits

type DownloadResult added in v0.9.3

type DownloadResult struct {
	Task    *DownloadTask
	Success bool
	Error   error
}

DownloadResult represents the result of a download attempt

type DownloadTask added in v0.9.3

type DownloadTask struct {
	URL         string
	PDFUrl      string
	Filename    string
	FullPath    string
	Paper       *PaperMetadata
	OriginalURL string
}

DownloadTask represents a download job

type PaperMetadata added in v0.9.2

type PaperMetadata struct {
	ID             string   // Row index or identifier
	URL            string   // Best URL or resolved DOI
	DOI            string   // Digital Object Identifier
	Title          string   // Article title
	Authors        string   // Authors list
	Year           string   // Publication year
	Journal        string   // Source/Journal title
	Abstract       string   // Abstract text (for future use)
	Downloaded     bool     // Whether the file was successfully downloaded
	Filename       string   // Generated filename for the download
	ErrorMsg       string   // Error message if download failed
	OriginalRecord []string // Original CSV/TSV row data for preserving all columns
}

PaperMetadata holds information about a paper to be downloaded

type RetryConfig added in v0.9.3

type RetryConfig struct {
	MaxRetries int
	BaseDelay  time.Duration
	MaxDelay   time.Duration
	Jitter     bool
}

RetryConfig holds retry policy configuration

func DefaultRetryConfig added in v0.9.3

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns a sensible default retry configuration

type UnpaywallResponse added in v0.9.3

type UnpaywallResponse struct {
	DOI            string `json:"doi"`
	IsOA           bool   `json:"is_oa"`
	BestOALocation struct {
		HostType  string `json:"host_type"`
		URLForPDF string `json:"url_for_pdf"`
	} `json:"best_oa_location"`
	OALocations []struct {
		HostType  string `json:"host_type"`
		URLForPDF string `json:"url_for_pdf"`
	} `json:"oa_locations"`
}

UnpaywallResponse represents the response from Unpaywall API

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL