list

package

v0.11.2 Latest Latest Go to latest Published: Feb 13, 2026 License: AGPL-3.0 Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/open-and-sustainable/prismaid

Links

Open Source Insights

Documentation ¶

Overview ¶

Package list provides functionality for downloading PDFs from URL lists, CSV files, or TSV files.

The package supports three input formats:

1. Plain text files (.txt or no extension):

One URL per line
Lines starting with '#' are treated as comments
Empty lines are ignored

2. CSV files (.csv):

Comma-separated values with automatic column detection
Intelligent parsing of paper metadata
Generates meaningful filenames from metadata

3. TSV files (.tsv):

Tab-separated values with automatic column detection
Same features as CSV support

Column Detection:

The package automatically detects and uses the following columns if present:

URL/Link columns: BestLink, BestURL, URL, Link, href (prioritizes "best" variants)
DOI column: Automatically converts DOIs to resolvable URLs if no direct URL is found
Title column: ArticleTitle, Article_Title, Paper_Title, Title
Authors column: Authors, Creator, Contributor
Year column: PublicationYear, Publication_Year, Year
Journal column: SourceTitle, Source_Title, Journal, Source, Publication
Abstract column: Abstract (for future use)

File Naming:

For CSV/TSV inputs, the package generates intelligent filenames using available metadata:

Format: [Year]_[FirstAuthorLastName]_[TruncatedTitle].pdf
Example: 2023_Smith_Climate_change_impacts.pdf
Falls back to row ID if metadata is insufficient

Output:

For CSV/TSV inputs, the package generates a download report with:

Original metadata
Download success status
Generated filename
Error messages for failed downloads

The report is saved as [input_filename]_report.csv in the same directory.

Usage:

// Download from plain text URL list
err := list.DownloadURLList("urls.txt")

// Download from CSV with metadata
err := list.DownloadURLList("papers.csv")

// Download from TSV with metadata
err := list.DownloadURLList("papers.tsv")

Example CSV Input:

ArticleTitle,Authors,PublicationYear,BestLink,DOI
"Climate Change Impacts","Smith, J.; Jones, M.",2023,https://example.com/paper1.pdf,10.1234/abc
"Machine Learning Review","Brown, A.",2024,,10.5678/def

The package will: 1. Parse the CSV/TSV structure 2. Detect column mappings automatically 3. Use BestLink when available, or resolve DOI to URL 4. Generate filename like "2023_Smith_Climate_Change_Impacts.pdf" 5. Download PDFs to the same directory as the input file 6. Create a report showing success/failure for each paper

Index ¶

func DownloadURLList(path string) error
type ColumnMapping
type ConcurrentDownloader
- func NewConcurrentDownloader(maxGlobal, maxPerHost int64) *ConcurrentDownloader
type DownloadResult
type DownloadTask
type PaperMetadata
type RetryConfig
- func DefaultRetryConfig() RetryConfig
type UnpaywallResponse

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DownloadURLList ¶

func DownloadURLList(path string) error

DownloadURLList processes a text, CSV, or TSV file containing URLs and attempts to download PDFs from each entry.

This function supports three input formats: 1. Plain text file: One URL per line 2. CSV file: Comma-separated values with intelligent column detection 3. TSV file: Tab-separated values with intelligent column detection

For CSV/TSV files, the function automatically detects columns for: - URLs (BestLink, BestURL, URL, Link, etc.) - DOIs (converts to URLs if no direct URL available) - Title, Authors, Year, Journal (used for intelligent file naming)

The function will: - Parse the input file based on its extension (.csv, .tsv, .txt) - Generate meaningful filenames using paper metadata when available - Create a download report (_report.csv) for CSV/TSV inputs - Save all files to the same directory as the input file

Parameters:

path: The path to a file containing URLs or paper metadata

Returns an error if the function fails to open or read the input file, but continues processing even if individual URLs fail to download.

Types ¶

type ColumnMapping ¶ added in v0.9.2

type ColumnMapping struct {
	URL      int
	DOI      int
	Title    int
	Authors  int
	Year     int
	Journal  int
	Abstract int
}

ColumnMapping holds the detected column indices for relevant fields

type ConcurrentDownloader ¶ added in v0.9.3

type ConcurrentDownloader struct {
	// contains filtered or unexported fields
}

ConcurrentDownloader manages concurrent downloads with per-host and global limits

func NewConcurrentDownloader ¶ added in v0.9.3

func NewConcurrentDownloader(maxGlobal, maxPerHost int64) *ConcurrentDownloader

NewConcurrentDownloader creates a new concurrent downloader with specified limits

type DownloadResult ¶ added in v0.9.3

type DownloadResult struct {
	Task    *DownloadTask
	Success bool
	Error   error
}

DownloadResult represents the result of a download attempt

type DownloadTask ¶ added in v0.9.3

type DownloadTask struct {
	URL         string
	PDFUrl      string
	Filename    string
	FullPath    string
	Paper       *PaperMetadata
	OriginalURL string
}

DownloadTask represents a download job

type PaperMetadata ¶ added in v0.9.2

type PaperMetadata struct {
	ID             string   // Row index or identifier
	URL            string   // Best URL or resolved DOI
	DOI            string   // Digital Object Identifier
	Title          string   // Article title
	Authors        string   // Authors list
	Year           string   // Publication year
	Journal        string   // Source/Journal title
	Abstract       string   // Abstract text (for future use)
	Downloaded     bool     // Whether the file was successfully downloaded
	Filename       string   // Generated filename for the download
	ErrorMsg       string   // Error message if download failed
	OriginalRecord []string // Original CSV/TSV row data for preserving all columns
}

PaperMetadata holds information about a paper to be downloaded

type RetryConfig ¶ added in v0.9.3

type RetryConfig struct {
	MaxRetries int
	BaseDelay  time.Duration
	MaxDelay   time.Duration
	Jitter     bool
}

RetryConfig holds retry policy configuration

func DefaultRetryConfig ¶ added in v0.9.3

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns a sensible default retry configuration

type UnpaywallResponse ¶ added in v0.9.3

type UnpaywallResponse struct {
	DOI            string `json:"doi"`
	IsOA           bool   `json:"is_oa"`
	BestOALocation struct {
		HostType  string `json:"host_type"`
		URLForPDF string `json:"url_for_pdf"`
	} `json:"best_oa_location"`
	OALocations []struct {
		HostType  string `json:"host_type"`
		URLForPDF string `json:"url_for_pdf"`
	} `json:"oa_locations"`
}

UnpaywallResponse represents the response from Unpaywall API

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL